summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-26lockd: don't attempt blocking locks on nfs reexportsJ. Bruce Fields1-3/+10
As in the v4 case, it doesn't work well to block waiting for a lock on an nfs filesystem. As in the v4 case, that means we're depending on the client to poll. It's probably incorrect to depend on that, but I *think* clients do poll in practice. In any case, it's an improvement over hanging the lockd thread indefinitely as we currently are. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-26nfs: don't atempt blocking locks on nfs reexportsJ. Bruce Fields2-3/+7
NFS implements blocking locks by blocking inside its lock method. In the reexport case, this blocks the nfs server thread, which could lead to deadlocks since an nfs server thread might be required to unlock the conflicting lock. It also causes a crash, since the nfs server thread assumes it can free the lock when its lm_notify lock callback is called. Ideal would be to make the nfs lock method return without blocking in this case, but for now it works just not to attempt blocking locks. The difference is just that the original client will have to poll (as it does in the v4.0 case) instead of getting a callback when the lock's available. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-24Keep read and write fds with each nlm_fileJ. Bruce Fields5-41/+102
We shouldn't really be using a read-only file descriptor to take a write lock. Most filesystems will put up with it. But NFS, for example, won't. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-23lockd: update nlm_lookup_file reexport commentJ. Bruce Fields1-2/+3
Update comment to reflect that we *do* allow reexport, whether it's a good idea or not.... Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-23nlm: minor refactoringJ. Bruce Fields2-10/+10
Make this lookup slightly more concise, and prepare for changing how we look this up in a following patch. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-23nlm: minor nlm_lookup_file argument changeJ. Bruce Fields3-9/+11
It'll come in handy to get the whole nlm_lock. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-21lockd: lockd server-side shouldn't set fl_opsJ. Bruce Fields1-18/+12
Locks have two sets of op arrays, fl_lmops for the lock manager (lockd or nfsd), fl_ops for the filesystem. The server-side lockd code has been setting its own fl_ops, which leads to confusion (and crashes) in the reexport case, where the filesystem expects to be the only one setting fl_ops. And there's no reason for it that I can see-the lm_get/put_owner ops do the same job. Reported-by: Daire Byrne <daire@dneg.com> Tested-by: Daire Byrne <daire@dneg.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17nfsd4: Fix forced-expiry lockingJ. Bruce Fields1-2/+2
This should use the network-namespace-wide client_lock, not the per-client cl_lock. You shouldn't see any bugs unless you're actually using the forced-expiry interface introduced by 89c905beccbb. Fixes: 89c905beccbb "nfsd: allow forced expiration of NFSv4 clients" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17lockd: change the proc_handler for nsm_use_hostnamesJia He1-1/+1
nsm_use_hostnames is a module parameter and it will be exported to sysctl procfs. This is to let user sometimes change it from userspace. But the minimal unit for sysctl procfs read/write it sizeof(int). In big endian system, the converting from/to bool to/from int will cause error for proc items. This patch use a new proc_handler proc_dobool to fix it. Signed-off-by: Jia He <hejianet@gmail.com> Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> [thuth: Fix typo in commit message] Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17NFSD: remove vanity commentsNeilBrown1-1/+0
Including one's name in copyright claims is appropriate. Including it in random comments is just vanity. After 2 decades, it is time for these to be gone. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17lockd: Fix invalid lockowner cast after vfs_test_lockBenjamin Coddington1-1/+1
After calling vfs_test_lock() the pointer to a conflicting lock can be returned, and that lock is not guarunteed to be owned by nlm. In that case, we cannot cast it to struct nlm_lockowner. Instead return the pid of that conflicting lock. Fixes: 646d73e91b42 ("lockd: Show pid of lockd for remote locks") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17NFSD: Use new __string_len C macros for nfsd_clid_classChuck Lever1-3/+2
Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17NFSD: Use new __string_len C macros for the nfs_dirent tracepointChuck Lever1-7/+5
Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17NFSD: Batch release pages during splice readChuck Lever1-7/+2
Large splice reads call put_page() repeatedly. put_page() is relatively expensive to call, so replace it with the new svc_rqst_replace_page() helper to help amortize that cost. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de>
2021-08-17NFSD: Clean up splice actorChuck Lever1-8/+3
A few useful observations: - The value in @size is never modified. - splice_desc.len is an unsigned int, and so is xdr_buf.page_len. An implicit cast to size_t is unnecessary. - The computation of .page_len is the same in all three arms of the "if" statement, so hoist it out to make it clear that the operation is an unconditional invariant. The resulting function is 18 bytes shorter on my system (-Os). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de>
2021-08-15Merge tag 'libnvdimm-fixes-5.14-rc6' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "A couple of fixes for long standing bugs, a warning fixup, and some miscellaneous dax cleanups. The bugs were recently found due to new platforms looking to use the ACPI NFIT "virtual" device definition, and new error injection capabilities to trigger error responses to label area requests. Ira's cleanups have been long pending, I neglected to send them earlier, and see no harm in including them now. This has all appeared in -next with no reported issues. Summary: - Fix support for NFIT "virtual" ranges (BIOS-defined memory disks) - Fix recovery from failed label storage areas on NVDIMM devices - Miscellaneous cleanups from Ira's investigation of dax_direct_access paths preparing for stray-write protection" * tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: tools/testing/nvdimm: Fix missing 'fallthrough' warning libnvdimm/region: Fix label activation vs errors ACPI: NFIT: Fix support for virtual SPA ranges dax: Ensure errno is returned from dax_direct_access fs/dax: Clarify nr_pages to dax_direct_access() fs/fuse: Remove unneeded kaddr parameter
2021-08-14Merge tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfsLinus Torvalds1-12/+6
Pull configfs fix from Christoph Hellwig: - fix to revert to the historic write behavior (Bart Van Assche) * tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs: configfs: restore the kernel v5.13 text attribute write behavior
2021-08-14Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-33/+80
Pull cifs fixes from Steve French: "Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close, and one for the 'modefromsid' mount option (when 'idsfromsid' not specified)" * tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Call close synchronously during unlink/rename/lease break. cifs: Handle race conditions during rename cifs: use the correct max-length for dentry_path_raw() cifs: create sd context must be a multiple of 8
2021-08-14Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-blockLinus Torvalds2-36/+48
Pull io_uring fixes from Jens Axboe: "A bit bigger than the previous weeks, but mostly just a few stable bound fixes. In detail: - Followup fixes to patches from last week for io-wq, turns out they weren't complete (Hao) - Two lockdep reported fixes out of the RT camp (me) - Sync the io_uring-cp example with liburing, as a few bug fixes never made it to the kernel carried version (me) - SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav) - Use WRITE_ONCE() when writing sq flags (Nadav) - io_rsrc_put_work() deadlock fix (Pavel)" * tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block: tools/io_uring/io_uring-cp: sync with liburing example io_uring: fix ctx-exit io_rsrc_put_work() deadlock io_uring: drop ctx->uring_lock before flushing work item io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() io-wq: fix bug of creating io-wokers unconditionally io_uring: rsrc ref lock needs to be IRQ safe io_uring: Use WRITE_ONCE() when writing to sq_flags io_uring: clear TIF_NOTIFY_SIGNAL when running task work
2021-08-13Merge tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-clientLinus Torvalds4-28/+50
Pull ceph fixes from Ilya Dryomov: "A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis and a reference handling fix from Jeff that should address some memory corruption reports in the snaprealm area. Both marked for stable" * tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client: ceph: take snap_empty_lock atomically with snaprealm refcount change ceph: reduce contention in ceph_check_delayed_caps()
2021-08-12Merge branch 'for-v5.14' of ↵Linus Torvalds2-12/+22
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fix from Eric Biederman: "This fixes the ucount sysctls on big endian architectures. The counts were expanded to be longs instead of ints, and the sysctl code was overlooked, so only the low 32bit were being processed. On litte endian just processing the low 32bits is fine, but on 64bit big endian processing just the low 32bits results in the high order bits instead of the low order bits being processed and nothing works proper. This change took a little bit to mature as we have the SYSCTL_ZERO, and SYSCTL_INT_MAX macros that are only usable for sysctls operating on ints, but unfortunately are not obviously broken. Which resulted in the versions of this change working on big endian and not on little endian, because the int SYSCTL_ZERO when extended 64bit wound up being 0x100000000. So we only allowed values greater than 0x100000000 and less than 0faff. Which unfortunately broken everything that tried to set the sysctls. (First reported with the windows subsystem for linux). I have tested this on x86_64 64bit after first reproducing the problems with the earlier version of this change, and then verifying the problems do not exist when we use appropriate long min and max values for extra1 and extra2" * 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: add missing data type changes
2021-08-12cifs: Call close synchronously during unlink/rename/lease break.Rohith Surabattula3-30/+56
During unlink/rename/lease break, deferred work for close is scheduled immediately but in an asynchronous manner which might lead to race with actual(unlink/rename) commands. This change will schedule close synchronously which will avoid the race conditions with other commands. Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org # 5.13 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-12cifs: Handle race conditions during renameRohith Surabattula2-7/+28
When rename is executed on directory which has files for which close is deferred, then rename will fail with EACCES. This patch will try to close all deferred files when EACCES is received and retry rename on a directory. Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Cc: stable@vger.kernel.org # 5.13 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-11Merge branch 'for-5.14/dax' into libnvdimm-fixesDan Williams2-5/+3
Pick up some small dax cleanups that make some of Ira's follow on work easier.
2021-08-10Merge tag 'ovl-fixes-5.14-rc6-v2' of ↵Linus Torvalds4-16/+80
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix several bugs in overlayfs" * tag 'ovl-fixes-5.14-rc6-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: prevent private clone if bind mount is not allowed ovl: fix uninitialized pointer read in ovl_lookup_real_one() ovl: fix deadlock in splice write ovl: skip stale entries in merge dir cache iteration
2021-08-10cifs: use the correct max-length for dentry_path_raw()Ronnie Sahlberg1-1/+1
RHBZ: 1972502 PATH_MAX is 4096 but PAGE_SIZE can be >4096 on some architectures such as ppc and would thus write beyond the end of the actual object. Cc: <stable@vger.kernel.org> Reported-by: Xiaoli Feng <xifeng@redhat.com> Suggested-by: Brian foster <bfoster@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-10ovl: prevent private clone if bind mount is not allowedMiklos Szeredi1-14/+28
Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Fixes: c771d683a62e ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: fix uninitialized pointer read in ovl_lookup_real_one()Miklos Szeredi1-1/+1
One error path can result in release_dentry_name_snapshot() being called before "name" was initialized by take_dentry_name_snapshot(). Fix by moving the release_dentry_name_snapshot() to immediately after the only use. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: fix deadlock in splice writeMiklos Szeredi1-1/+46
There's possibility of an ABBA deadlock in case of a splice write to an overlayfs file and a concurrent splice write to a corresponding real file. The call chain for splice to an overlay file: -> do_splice [takes sb_writers on overlay file] -> do_splice_from -> iter_file_splice_write [takes pipe->mutex] -> vfs_iter_write ... -> ovl_write_iter [takes sb_writers on real file] And the call chain for splice to a real file: -> do_splice [takes sb_writers on real file] -> do_splice_from -> iter_file_splice_write [takes pipe->mutex] Syzbot successfully bisected this to commit 82a763e61e2b ("ovl: simplify file splice"). Fix by reverting the write part of the above commit and by adding missing bits from ovl_write_iter() into ovl_splice_write(). Fixes: 82a763e61e2b ("ovl: simplify file splice") Reported-and-tested-by: syzbot+579885d1a9a833336209@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: skip stale entries in merge dir cache iterationAmir Goldstein1-0/+5
On the first getdents call, ovl_iterate() populates the readdir cache with a list of entries, but for upper entries with origin lower inode, p->ino remains zero. Following getdents calls traverse the readdir cache list and call ovl_cache_update_ino() for entries with zero p->ino to lookup the entry in the overlay and return d_ino that is consistent with st_ino. If the upper file was unlinked between the first getdents call and the getdents call that lists the file entry, ovl_cache_update_ino() will not find the entry and fall back to setting d_ino to the upper real st_ino, which is inconsistent with how this object was presented to users. Instead of listing a stale entry with inconsistent d_ino, simply skip the stale entry, which is better for users. xfstest overlay/077 is failing without this patch. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10io_uring: fix ctx-exit io_rsrc_put_work() deadlockPavel Begunkov1-7/+8
__io_rsrc_put_work() might need ->uring_lock, so nobody should wait for rsrc nodes holding the mutex. However, that's exactly what io_ring_ctx_free() does with io_wait_rsrc_data(). Split it into rsrc wait + dealloc, and move the first one out of the lock. Cc: stable@vger.kernel.org Fixes: b60c8dce33895 ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10io_uring: drop ctx->uring_lock before flushing work itemJens Axboe1-2/+4
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); *** DEADLOCK *** 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()Hao Xu1-7/+11
There may be cases like: A B spin_lock(wqe->lock) nr_workers is 0 nr_workers++ spin_unlock(wqe->lock) spin_lock(wqe->lock) nr_wokers is 1 nr_workers++ spin_unlock(wqe->lock) create_io_worker() acct->worker is 1 create_io_worker() acct->worker is 1 There should be one worker marked IO_WORKER_F_FIXED, but no one is. Fix this by introduce a new agrument for create_io_worker() to indicate if it is the first worker. Fixes: 3d4e4face9c1 ("io-wq: fix no lock protection of acct->nr_worker") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10io-wq: fix bug of creating io-wokers unconditionallyHao Xu1-2/+10
The former patch to add check between nr_workers and max_workers has a bug, which will cause unconditionally creating io-workers. That's because the result of the check doesn't affect the call of create_io_worker(), fix it by bringing in a boolean value for it. Fixes: 21698274da5b ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com [axboe: drop hunk that isn't strictly needed] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10io_uring: rsrc ref lock needs to be IRQ safeJens Axboe1-14/+5
Nadav reports running into the below splat on re-enabling softirqs: WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0 Modules linked in: CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 RIP: 0010:__local_bh_enable_ip+0xaa/0xe0 Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65 RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9 R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890 FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0 Call Trace: _raw_spin_unlock_bh+0x31/0x40 io_rsrc_node_ref_zero+0x13e/0x190 io_dismantle_req+0x215/0x220 io_req_complete_post+0x1b8/0x720 __io_complete_rw.isra.0+0x16b/0x1f0 io_complete_rw+0x10/0x20 where it's clear we end up calling the percpu count release directly from the completion path, as it's in atomic mode and we drop the last ref. For file/block IO, this can be from IRQ context already, and the softirq locking for rsrc isn't enough. Just make the lock fully IRQ safe, and ensure we correctly safe state from the release path as we don't know the full context there. Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09ucounts: add missing data type changesSven Schnelle2-12/+22
commit f9c82a4ea89c3 ("Increase size of ucounts to atomic_long_t") changed the data type of ucounts/ucounts_max to long, but missed to adjust a few other places. This is noticeable on big endian platforms from user space because the /proc/sys/user/max_*_names files all contain 0. v4 - Made the min and max constants long so the sysctl values are actually settable on little endian machines. -- EWB Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Acked-by: Alexey Gladkov <legion@kernel.org> v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133 Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-08-09configfs: restore the kernel v5.13 text attribute write behaviorBart Van Assche1-12/+6
Instead of appending new text attribute data at the offset specified by the write() system call, only pass the newly written data to the .store() callback. Reported-by: Bodo Stroesser <bostroesser@gmail.com> Tested-by: Bodo Stroesser <bostroesser@gmail.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-09io_uring: Use WRITE_ONCE() when writing to sq_flagsNadav Amit1-4/+9
The compiler should be forbidden from any strange optimization for async writes to user visible data-structures. Without proper protection, the compiler can cause write-tearing or invent writes that would confuse the userspace. However, there are writes to sq_flags which are not protected by WRITE_ONCE(). Use WRITE_ONCE() for these writes. This is purely a theoretical issue. Presumably, any compiler is very unlikely to do such optimizations. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io_uring: clear TIF_NOTIFY_SIGNAL when running task workNadav Amit1-2/+3
When using SQPOLL, the submission queue polling thread calls task_work_run() to run queued work. However, when work is added with TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains set afterwards and is never cleared. Consequently, when the submission queue polling thread checks whether signal_pending(), it may always find a pending signal, if task_work_add() was ever called before. The impact of this bug might be different on different kernel versions. It appears that on 5.14 it would only cause unnecessary calculation and prevent the polling thread from sleeping. On 5.13, where the bug was found, it stops the polling thread from finding newly submitted work. Instead of task_work_run(), use tracehook_notify_signal() that clears TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to current->task_works to avoid a race in which task_works is cleared but the TIF_NOTIFY_SIGNAL is set. Fixes: 685fe7feedb96 ("io-wq: eliminate the need for a manager thread") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-07Merge tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-blockLinus Torvalds1-26/+45
Pull io_uring from Jens Axboe: "A few io-wq related fixes: - Fix potential nr_worker race and missing max_workers check from one path (Hao) - Fix race between worker exiting and new work queue (me)" * tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-block: io-wq: fix lack of acct->nr_workers < acct->max_workers judgement io-wq: fix no lock protection of acct->nr_worker io-wq: fix race between worker exiting and activating free worker
2021-08-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds3-5/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A regression fix, bug fix, and a comment cleanup for ext4" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix potential htree corruption when growing large_dir directories ext4: remove conflicting comment from __ext4_forget ext4: fix potential uninitialized access to retval in kmmpd
2021-08-06ext4: fix potential htree corruption when growing large_dir directoriesTheodore Ts'o1-1/+1
Commit b5776e7524af ("ext4: fix potential htree index checksum corruption) removed a required restart when multiple levels of index nodes need to be split. Fix this to avoid directory htree corruptions when using the large_dir feature. Cc: stable@kernel.org # v5.11 Cc: Благодаренко Артём <artem.blagodarenko@gmail.com> Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption) Reported-by: Denis <denis@voxelsoft.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-06io-wq: fix lack of acct->nr_workers < acct->max_workers judgementHao Xu1-1/+9
There should be this judgement before we create an io-worker Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-06io-wq: fix no lock protection of acct->nr_workerHao Xu1-6/+17
There is an acct->nr_worker visit without lock protection. Think about the case: two callers call io_wqe_wake_worker(), one is the original context and the other one is an io-worker(by calling io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause nr_worker to be larger than max_worker. Let's fix it by adding lock for it, and let's do nr_workers++ before create_io_worker. There may be a edge cause that the first caller fails to create an io-worker, but the second caller doesn't know it and then quit creating io-worker as well: say nr_worker = max_worker - 1 cpu 0 cpu 1 io_wqe_wake_worker() io_wqe_wake_worker() nr_worker < max_worker nr_worker++ create_io_worker() nr_worker == max_worker failed return return But the chance of this case is very slim. Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fix unconditional create_io_worker() call] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-05cifs: create sd context must be a multiple of 8Shyam Prasad N1-1/+1
We used to follow the rule earlier that the create SD context always be a multiple of 8. However, with the change: cifs: refactor create_sd_buf() and and avoid corrupting the buffer ...we recompute the length, and we failed that rule. Fixing that with this change. Cc: <stable@vger.kernel.org> # v5.10+ Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-05pipe: increase minimum default pipe size to 2 pagesAlex Xu (Hello71)1-2/+17
This program always prints 4096 and hangs before the patch, and always prints 8192 and exits successfully after: int main() { int pipefd[2]; for (int i = 0; i < 1025; i++) if (pipe(pipefd) == -1) return 1; size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ); printf("%zd\n", bufsz); char *buf = calloc(bufsz, 1); write(pipefd[1], buf, bufsz); read(pipefd[0], buf, bufsz-1); write(pipefd[1], buf, 1); } Note that you may need to increase your RLIMIT_NOFILE before running the program. Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/ Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/ Signed-off-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-04io-wq: fix race between worker exiting and activating free workerJens Axboe1-19/+19
Nadav correctly reports that we have a race between a worker exiting, and new work being queued. This can lead to work being queued behind an existing worker that could be sleeping on an event before it can run to completion, and hence introducing potential big latency gaps if we hit this race condition: cpu0 cpu1 ---- ---- io_wqe_worker() schedule_timeout() // timed out io_wqe_enqueue() io_wqe_wake_worker() // work_flags & IO_WQ_WORK_CONCURRENT io_wqe_activate_free_worker() io_worker_exit() Fix this by having the exiting worker go through the normal decrement of a running worker, which will spawn a new one if needed. The free worker activation is modified to only return success if we were able to find a sleeping worker - if not, we keep looking through the list. If we fail, we create a new worker as per usual. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/ Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-04ceph: take snap_empty_lock atomically with snaprealm refcount changeJeff Layton1-17/+17
There is a race in ceph_put_snap_realm. The change to the nref and the spinlock acquisition are not done atomically, so you could decrement nref, and before you take the spinlock, the nref is incremented again. At that point, you end up putting it on the empty list when it shouldn't be there. Eventually __cleanup_empty_realms runs and frees it when it's still in-use. Fix this by protecting the 1->0 transition with atomic_dec_and_lock, and just drop the spinlock if we can get the rwsem. Because these objects can also undergo a 0->1 refcount transition, we must protect that change as well with the spinlock. Increment locklessly unless the value is at 0, in which case we take the spinlock, increment and then take it off the empty list if it did the 0->1 transition. With these changes, I'm removing the dout() messages from these functions, as well as in __put_snap_realm. They've always been racy, and it's better to not print values that may be misleading. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46419 Reported-by: Mark Nelson <mnelson@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04ceph: reduce contention in ceph_check_delayed_caps()Luis Henriques3-11/+33
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag. This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run. Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-01Merge tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds7-106/+244
Pull xfs fixes from Darrick Wong: "This contains a bunch of bug fixes in XFS. Dave and I have been busy the last couple of weeks to find and fix as many log recovery bugs as we can find; here are the results so far. Go fstests -g recoveryloop! ;) - Fix a number of coordination bugs relating to cache flushes for metadata writeback, cache flushes for multi-buffer log writes, and FUA writes for single-buffer log writes - Fix a bug with incorrect replay of attr3 blocks - Fix unnecessary stalls when flushing logs to disk - Fix spoofing problems when recovering realtime bitmap blocks" * tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: prevent spoofing of rtbitmap blocks when recovering buffers xfs: limit iclog tail updates xfs: need to see iclog flags in tracing xfs: Enforce attr3 buffer recovery order xfs: logging the on disk inode LSN can make it go backwards xfs: avoid unnecessary waits in xfs_log_force_lsn() xfs: log forces imply data device cache flushes xfs: factor out forced iclog flushes xfs: fix ordering violation between cache flushes and tail updates xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog xfs: external logs need to flush data device xfs: flush data dev on external log write