summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-06-21NFS: add barriers when testing for NFS_FSDATA_BLOCKEDNeilBrown1-15/+32
[ Upstream commit 99bc9f2eb3f79a2b4296d9bf43153e1d10ca50d3 ] dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or renaming-over a file to ensure that no open succeeds while the NFS operation progressed on the server. Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock after checking the refcount is not elevated. Any attempt to open the file (through that name) will go through lookp_open() which will take ->d_lock while incrementing the refcount, we can be sure that once the new value is set, __nfs_lookup_revalidate() *will* see the new value and will block. We don't have any locking guarantee that when we set ->d_fsdata to NULL, the wait_var_event() in __nfs_lookup_revalidate() will notice. wait/wake primitives do NOT provide barriers to guarantee order. We must use smp_load_acquire() in wait_var_event() to ensure we look at an up-to-date value, and must use smp_store_release() before wake_up_var(). This patch adds those barrier functions and factors out block_revalidate() and unblock_revalidate() far clarity. There is also a hypothetical bug in that if memory allocation fails (which never happens in practice) we might leave ->d_fsdata locked. This patch adds the missing call to unblock_revalidate(). Reported-and-tested-by: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501 Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21NFSv4.1 enforce rootpath check in fs_location queryOlga Kornievskaia1-1/+22
[ Upstream commit 28568c906c1bb5f7560e18082ed7d6295860f1c2 ] In commit 4ca9f31a2be66 ("NFSv4.1 test and add 4.1 trunking transport"), we introduce the ability to query the NFS server for possible trunking locations of the existing filesystem. However, we never checked the returned file system path for these alternative locations. According to the RFC, the server can say that the filesystem currently known under "fs_root" of fs_location also resides under these server locations under the following "rootpath" pathname. The client cannot handle trunking a filesystem that reside under different location under different paths other than what the main path is. This patch enforces the check that fs_root path and rootpath path in fs_location reply is the same. Fixes: 4ca9f31a2be6 ("NFSv4.1 test and add 4.1 trunking transport") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: flush all requests after setting CACHEFILES_DEADBaokun Li2-1/+4
[ Upstream commit 85e833cd7243bda7285492b0653c3abb1e2e757b ] In ondemand mode, when the daemon is processing an open request, if the kernel flags the cache as CACHEFILES_DEAD, the cachefiles_daemon_write() will always return -EIO, so the daemon can't pass the copen to the kernel. Then the kernel process that is waiting for the copen triggers a hung_task. Since the DEAD state is irreversible, it can only be exited by closing /dev/cachefiles. Therefore, after calling cachefiles_io_error() to mark the cache as CACHEFILES_DEAD, if in ondemand mode, flush all requests to avoid the above hungtask. We may still be able to read some of the cached data before closing the fd of /dev/cachefiles. Note that this relies on the patch that adds reference counting to the req, otherwise it may UAF. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-12-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: defer exposing anon_fd until after copy_to_user() succeedsBaokun Li1-20/+33
[ Upstream commit 4b4391e77a6bf24cba2ef1590e113d9b73b11039 ] After installing the anonymous fd, we can now see it in userland and close it. However, at this point we may not have gotten the reference count of the cache, but we will put it during colse fd, so this may cause a cache UAF. So grab the cache reference count before fd_install(). In addition, by kernel convention, fd is taken over by the user land after fd_install(), and the kernel should not call close_fd() after that, i.e., it should call fd_install() after everything is ready, thus fd_install() is called after copy_to_user() succeeds. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-10-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: never get a new anonymous fd if ondemand_id is validBaokun Li1-6/+28
[ Upstream commit 4988e35e95fc938bdde0e15880fe72042fc86acf ] Now every time the daemon reads an open request, it gets a new anonymous fd and ondemand_id. With the introduction of "restore", it is possible to read the same open request more than once, and therefore an object can have more than one anonymous fd. If the anonymous fd is not unique, the following concurrencies will result in an fd leak: t1 | t2 | t3 ------------------------------------------------------------ cachefiles_ondemand_init_object cachefiles_ondemand_send_req REQ_A = kzalloc(sizeof(*req) + data_len) wait_for_completion(&REQ_A->done) cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req cachefiles_ondemand_get_fd load->fd = fd0 ondemand_id = object_id0 ------ restore ------ cachefiles_ondemand_restore // restore REQ_A cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req cachefiles_ondemand_get_fd load->fd = fd1 ondemand_id = object_id1 process_open_req(REQ_A) write(devfd, ("copen %u,%llu", msg->msg_id, size)) cachefiles_ondemand_copen xa_erase(&cache->reqs, id) complete(&REQ_A->done) kfree(REQ_A) process_open_req(REQ_A) // copen fails due to no req // daemon close(fd1) cachefiles_ondemand_fd_release // set object closed -- umount -- cachefiles_withdraw_cookie cachefiles_ondemand_clean_object cachefiles_ondemand_init_close_req if (!cachefiles_ondemand_object_is_open(object)) return -ENOENT; // The fd0 is not closed until the daemon exits. However, the anonymous fd holds the reference count of the object and the object holds the reference count of the cookie. So even though the cookie has been relinquished, it will not be unhashed and freed until the daemon exits. In fscache_hash_cookie(), when the same cookie is found in the hash list, if the cookie is set with the FSCACHE_COOKIE_RELINQUISHED bit, then the new cookie waits for the old cookie to be unhashed, while the old cookie is waiting for the leaked fd to be closed, if the daemon does not exit in time it will trigger a hung task. To avoid this, allocate a new anonymous fd only if no anonymous fd has been allocated (ondemand_id == 0) or if the previously allocated anonymous fd has been closed (ondemand_id == -1). Moreover, returns an error if ondemand_id is valid, letting the daemon know that the current userland restore logic is abnormal and needs to be checked. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-9-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: remove err_put_fd label in cachefiles_ondemand_daemon_read()Baokun Li1-29/+16
[ Upstream commit 3e6d704f02aa4c50c7bc5fe91a4401df249a137b ] The err_put_fd label is only used once, so remove it to make the code more readable. In addition, the logic for deleting error request and CLOSE request is merged to simplify the code. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-6-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 4988e35e95fc ("cachefiles: never get a new anonymous fd if ondemand_id is valid") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: add spin_lock for cachefiles_ondemand_infoBaokun Li2-1/+35
[ Upstream commit 0a790040838c736495d5afd6b2d636f159f817f1 ] The following concurrency may cause a read request to fail to be completed and result in a hung: t1 | t2 --------------------------------------------------------- cachefiles_ondemand_copen req = xa_erase(&cache->reqs, id) // Anon fd is maliciously closed. cachefiles_ondemand_fd_release xa_lock(&cache->reqs) cachefiles_ondemand_set_object_close(object) xa_unlock(&cache->reqs) cachefiles_ondemand_set_object_open // No one will ever close it again. cachefiles_ondemand_daemon_read cachefiles_ondemand_select_req // Get a read req but its fd is already closed. // The daemon can't issue a cread ioctl with an closed fd, then hung. So add spin_lock for cachefiles_ondemand_info to protect ondemand_id and state, thus we can avoid the above problem in cachefiles_ondemand_copen() by using ondemand_id to determine if fd has been closed. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-8-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: fix slab-use-after-free in cachefiles_ondemand_daemon_read()Baokun Li1-0/+3
[ Upstream commit da4a827416066191aafeeccee50a8836a826ba10 ] We got the following issue in a fuzz test of randomly issuing the restore command: ================================================================== BUG: KASAN: slab-use-after-free in cachefiles_ondemand_daemon_read+0xb41/0xb60 Read of size 8 at addr ffff888122e84088 by task ondemand-04-dae/963 CPU: 13 PID: 963 Comm: ondemand-04-dae Not tainted 6.8.0-dirty #564 Call Trace: kasan_report+0x93/0xc0 cachefiles_ondemand_daemon_read+0xb41/0xb60 vfs_read+0x169/0xb50 ksys_read+0xf5/0x1e0 Allocated by task 116: kmem_cache_alloc+0x140/0x3a0 cachefiles_lookup_cookie+0x140/0xcd0 fscache_cookie_state_machine+0x43c/0x1230 [...] Freed by task 792: kmem_cache_free+0xfe/0x390 cachefiles_put_object+0x241/0x480 fscache_cookie_state_machine+0x5c8/0x1230 [...] ================================================================== Following is the process that triggers the issue: mount | daemon_thread1 | daemon_thread2 ------------------------------------------------------------ cachefiles_withdraw_cookie cachefiles_ondemand_clean_object(object) cachefiles_ondemand_send_req REQ_A = kzalloc(sizeof(*req) + data_len) wait_for_completion(&REQ_A->done) cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req msg->object_id = req->object->ondemand->ondemand_id ------ restore ------ cachefiles_ondemand_restore xas_for_each(&xas, req, ULONG_MAX) xas_set_mark(&xas, CACHEFILES_REQ_NEW) cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req copy_to_user(_buffer, msg, n) xa_erase(&cache->reqs, id) complete(&REQ_A->done) ------ close(fd) ------ cachefiles_ondemand_fd_release cachefiles_put_object cachefiles_put_object kmem_cache_free(cachefiles_object_jar, object) REQ_A->object->ondemand->ondemand_id // object UAF !!! When we see the request within xa_lock, req->object must not have been freed yet, so grab the reference count of object before xa_unlock to avoid the above issue. Fixes: 0a7e54c1959c ("cachefiles: resend an open request if the read request's object is closed") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-5-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: fix slab-use-after-free in cachefiles_ondemand_get_fd()Baokun Li2-4/+20
[ Upstream commit de3e26f9e5b76fc628077578c001c4a51bf54d06 ] We got the following issue in a fuzz test of randomly issuing the restore command: ================================================================== BUG: KASAN: slab-use-after-free in cachefiles_ondemand_daemon_read+0x609/0xab0 Write of size 4 at addr ffff888109164a80 by task ondemand-04-dae/4962 CPU: 11 PID: 4962 Comm: ondemand-04-dae Not tainted 6.8.0-rc7-dirty #542 Call Trace: kasan_report+0x94/0xc0 cachefiles_ondemand_daemon_read+0x609/0xab0 vfs_read+0x169/0xb50 ksys_read+0xf5/0x1e0 Allocated by task 626: __kmalloc+0x1df/0x4b0 cachefiles_ondemand_send_req+0x24d/0x690 cachefiles_create_tmpfile+0x249/0xb30 cachefiles_create_file+0x6f/0x140 cachefiles_look_up_object+0x29c/0xa60 cachefiles_lookup_cookie+0x37d/0xca0 fscache_cookie_state_machine+0x43c/0x1230 [...] Freed by task 626: kfree+0xf1/0x2c0 cachefiles_ondemand_send_req+0x568/0x690 cachefiles_create_tmpfile+0x249/0xb30 cachefiles_create_file+0x6f/0x140 cachefiles_look_up_object+0x29c/0xa60 cachefiles_lookup_cookie+0x37d/0xca0 fscache_cookie_state_machine+0x43c/0x1230 [...] ================================================================== Following is the process that triggers the issue: mount | daemon_thread1 | daemon_thread2 ------------------------------------------------------------ cachefiles_ondemand_init_object cachefiles_ondemand_send_req REQ_A = kzalloc(sizeof(*req) + data_len) wait_for_completion(&REQ_A->done) cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req cachefiles_ondemand_get_fd copy_to_user(_buffer, msg, n) process_open_req(REQ_A) ------ restore ------ cachefiles_ondemand_restore xas_for_each(&xas, req, ULONG_MAX) xas_set_mark(&xas, CACHEFILES_REQ_NEW); cachefiles_daemon_read cachefiles_ondemand_daemon_read REQ_A = cachefiles_ondemand_select_req write(devfd, ("copen %u,%llu", msg->msg_id, size)); cachefiles_ondemand_copen xa_erase(&cache->reqs, id) complete(&REQ_A->done) kfree(REQ_A) cachefiles_ondemand_get_fd(REQ_A) fd = get_unused_fd_flags file = anon_inode_getfile fd_install(fd, file) load = (void *)REQ_A->msg.data; load->fd = fd; // load UAF !!! This issue is caused by issuing a restore command when the daemon is still alive, which results in a request being processed multiple times thus triggering a UAF. So to avoid this problem, add an additional reference count to cachefiles_req, which is held while waiting and reading, and then released when the waiting and reading is over. Note that since there is only one reference count for waiting, we need to avoid the same request being completed multiple times, so we can only complete the request if it is successfully removed from the xarray. Fixes: e73fa11a356c ("cachefiles: add restore command to recover inflight ondemand read requests") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-4-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21cachefiles: remove requests from xarray during flushing requestsBaokun Li1-0/+1
[ Upstream commit 0fc75c5940fa634d84e64c93bfc388e1274ed013 ] Even with CACHEFILES_DEAD set, we can still read the requests, so in the following concurrency the request may be used after it has been freed: mount | daemon_thread1 | daemon_thread2 ------------------------------------------------------------ cachefiles_ondemand_init_object cachefiles_ondemand_send_req REQ_A = kzalloc(sizeof(*req) + data_len) wait_for_completion(&REQ_A->done) cachefiles_daemon_read cachefiles_ondemand_daemon_read // close dev fd cachefiles_flush_reqs complete(&REQ_A->done) kfree(REQ_A) xa_lock(&cache->reqs); cachefiles_ondemand_select_req req->msg.opcode != CACHEFILES_OP_READ // req use-after-free !!! xa_unlock(&cache->reqs); xa_destroy(&cache->reqs) Hence remove requests from cache->reqs when flushing them to avoid accessing freed requests. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-3-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21jfs: xattr: fix buffer overflow for invalid xattrGreg Kroah-Hartman1-1/+3
commit 7c55b78818cfb732680c4a72ab270cc2d2ee3d0f upstream. When an xattr size is not what is expected, it is printed out to the kernel log in hex format as a form of debugging. But when that xattr size is bigger than the expected size, printing it out can cause an access off the end of the buffer. Fix this all up by properly restricting the size of the debug hex dump in the kernel log. Reported-by: syzbot+9dfe490c8176301c1d06@syzkaller.appspotmail.com Cc: Dave Kleikamp <shaggy@kernel.org> Link: https://lore.kernel.org/r/2024051433-slider-cloning-98f9@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-21eventfs: Update all the eventfs_inodes from the events descriptorSteven Rostedt (Google)1-12/+39
[ Upstream commit 340f0c7067a95281ad13734f8225f49c6cf52067 ] The change to update the permissions of the eventfs_inode had the misconception that using the tracefs_inode would find all the eventfs_inodes that have been updated and reset them on remount. The problem with this approach is that the eventfs_inodes are freed when they are no longer used (basically the reason the eventfs system exists). When they are freed, the updated eventfs_inodes are not reset on a remount because their tracefs_inodes have been freed. Instead, since the events directory eventfs_inode always has a tracefs_inode pointing to it (it is not freed when finished), and the events directory has a link to all its children, have the eventfs_remount() function only operate on the events eventfs_inode and have it descend into its children updating their uid and gids. Link: https://lore.kernel.org/all/CAK7LNARXgaWw3kH9JgrnH4vK6fr8LDkNKf3wq8NhMWJrVwJyVQ@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.754424703@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16smb: client: fix deadlock in smb2_find_smb_tcon()Enzo Matsumiya1-1/+1
commit 02c418774f76a0a36a6195c9dbf8971eb4130a15 upstream. Unlock cifs_tcp_ses_lock before calling cifs_put_smb_ses() to avoid such deadlock. Cc: stable@vger.kernel.org Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errorsRyusuke Konishi1-1/+1
commit 7373a51e7998b508af7136530f3a997b286ce81c upstream. The error handling in nilfs_empty_dir() when a directory folio/page read fails is incorrect, as in the old ext2 implementation, and if the folio/page cannot be read or nilfs_check_folio() fails, it will falsely determine the directory as empty and corrupt the file system. In addition, since nilfs_empty_dir() does not immediately return on a failed folio/page read, but continues to loop, this can cause a long loop with I/O if i_size of the directory's inode is also corrupted, causing the log writer thread to wait and hang, as reported by syzbot. Fix these issues by making nilfs_empty_dir() immediately return a false value (0) if it fails to get a directory folio/page. Link: https://lkml.kernel.org/r/20240604134255.7165-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+c8166c541d3971bf6c87@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c8166c541d3971bf6c87 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16nilfs2: fix potential kernel bug due to lack of writeback flag waitingRyusuke Konishi1-0/+3
commit a4ca369ca221bb7e06c725792ac107f0e48e82e7 upstream. Destructive writes to a block device on which nilfs2 is mounted can cause a kernel bug in the folio/page writeback start routine or writeback end routine (__folio_start_writeback in the log below): kernel BUG at mm/page-writeback.c:3070! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI ... RIP: 0010:__folio_start_writeback+0xbaa/0x10e0 Code: 25 ff 0f 00 00 0f 84 18 01 00 00 e8 40 ca c6 ff e9 17 f6 ff ff e8 36 ca c6 ff 4c 89 f7 48 c7 c6 80 c0 12 84 e8 e7 b3 0f 00 90 <0f> 0b e8 1f ca c6 ff 4c 89 f7 48 c7 c6 a0 c6 12 84 e8 d0 b3 0f 00 ... Call Trace: <TASK> nilfs_segctor_do_construct+0x4654/0x69d0 [nilfs2] nilfs_segctor_construct+0x181/0x6b0 [nilfs2] nilfs_segctor_thread+0x548/0x11c0 [nilfs2] kthread+0x2f0/0x390 ret_from_fork+0x4b/0x80 ret_from_fork_asm+0x1a/0x30 </TASK> This is because when the log writer starts a writeback for segment summary blocks or a super root block that use the backing device's page cache, it does not wait for the ongoing folio/page writeback, resulting in an inconsistent writeback state. Fix this issue by waiting for ongoing writebacks when putting folios/pages on the backing device into writeback state. Link: https://lkml.kernel.org/r/20240530141556.4411-1-konishi.ryusuke@gmail.com Fixes: 9ff05123e3bf ("nilfs2: segment constructor") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: re-introduce 'norecovery' mount optionQu Wenruo1-0/+8
commit 440861b1a03c72cc7be4a307e178dcaa6894479b upstream. Although 'norecovery' mount option was marked as deprecated for a long time and a warning message was printed during the deprecation window, it's still actively utilized by several projects that need a safer way to mount a btrfs without any writes. Furthermore this 'norecovery' mount option is supported by other major filesystems, which makes it less clear what's our motivation to remove it. Re-introduce the 'norecovery' mount option, and output a message to recommend 'rescue=nologreplay' option. Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t Link: https://github.com/systemd/systemd/pull/32892 Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429 Reported-by: Lennart Poettering <lennart@poettering.net> Reported-by: Jiri Slaby <jslaby@suse.com> Fixes: a1912f712188 ("btrfs: remove code for inode_cache and recovery mount options") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: fix leak of qgroup extent records after transaction abortFilipe Manana1-9/+1
commit fb33eb2ef0d88e75564983ef057b44c5b7e4fded upstream. Qgroup extent records are created when delayed ref heads are created and then released after accounting extents at btrfs_qgroup_account_extents(), called during the transaction commit path. If a transaction is aborted we free the qgroup records by calling btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(), unless we don't have delayed references. We are incorrectly assuming that no delayed references means we don't have qgroup extents records. We can currently have no delayed references because we ran them all during a transaction commit and the transaction was aborted after that due to some error in the commit path. So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs() even if we don't have any delayed references. Reported-by: syzbot+0fecc032fa134afd49df@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000004e7f980619f91835@google.com/ Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: fix crash on racing fsync and size-extending write into preallocOmar Sandoval1-6/+11
commit 9d274c19a71b3a276949933859610721a453946b upstream. We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: protect folio::private when attaching extent buffer foliosQu Wenruo1-29/+31
commit f3a5367c679d31473d3fbb391675055b4792c309 upstream. [BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer. Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize). This can lead to the following race: Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X). Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put() Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping. The condition is not that easy to hit: - The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race. - folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private(). That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up. At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer. [FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection. Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock. To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()): #!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done } run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done } mkfs.btrfs -f -d single -m single /dev/vda mount -o noatime /dev/vda /mnt # create 200,000 files, 1K each ./simoop -n 200000 -E -f 1k /mnt drop_caches & (run_tar) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/ Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/ Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") CC: stable@vger.kernel.org # 6.8+ CC: Chris Mason <clm@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: qgroup: fix qgroup id collision across mountsBoris Burkov1-0/+20
commit 2b8aa78cf1279ec5e418baa26bfed5df682568d8 upstream. If we delete subvolumes whose ID is the largest in the filesystem, then unmount and mount again, then btrfs_init_root_free_objectid on the tree_root will select a subvolid smaller than that one and thus allow reusing it. If we are also using qgroups (and particularly squotas) it is possible to delete the subvol without deleting the qgroup. In that case, we will be able to create a new subvol whose id already has a level 0 qgroup. This will result in re-using that qgroup which would then lead to incorrect accounting. Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: qgroup: update rescan message levels and error codesDavid Sterba1-7/+5
commit 1fa7603d569b9e738e9581937ba8725cd7d39b48 upstream. On filesystems without enabled quotas there's still a warning message in the logs when rescan is called. In that case it's not a problem that should be reported, rescan can be called unconditionally. Change the error code to ENOTCONN which is used for 'quotas not enabled' elsewhere. Remove message (also a warning) when rescan is called during an ongoing rescan, this brings no useful information and the error code is sufficient. Change message levels to debug for now, they can be removed eventually. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16tracefs: Clear EVENT_INODE flag in tracefs_drop_inode()Steven Rostedt (Google)1-16/+17
commit 0bcfd9aa4dafa03b88d68bf66b694df2a3e76cf3 upstream. When the inode is being dropped from the dentry, the TRACEFS_EVENT_INODE flag needs to be cleared to prevent a remount from calling eventfs_remount() on the tracefs_inode private data. There's a race between the inode is dropped (and the dentry freed) to where the inode is actually freed. If a remount happens between the two, the eventfs_inode could be accessed after it is freed (only the dentry keeps a ref count on it). Currently the TRACEFS_EVENT_INODE flag is cleared from the dentry iput() function. But this is incorrect, as it is possible that the inode has another reference to it. The flag should only be cleared when the inode is really being dropped and has no more references. That happens in the drop_inode callback of the inode, as that gets called when the last reference of the inode is released. Remove the tracefs_d_iput() function and move its logic to the more appropriate tracefs_drop_inode() callback function. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.908205106@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16eventfs: Keep the directories from having the same inode number as filesSteven Rostedt (Google)1-1/+5
commit 8898e7f288c47d450a3cf1511c791a03550c0789 upstream. The directories require unique inode numbers but all the eventfs files have the same inode number. Prevent the directories from having the same inode numbers as the files as that can confuse some tooling. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.428826685@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: 834bf76add3e6 ("eventfs: Save directory inodes in the eventfs_inode structure") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16eventfs: Fix a possible null pointer dereference in eventfs_find_events()Hao Ge1-4/+3
commit d4e9a968738bf66d3bb852dd5588d4c7afd6d7f4 upstream. In function eventfs_find_events,there is a potential null pointer that may be caused by calling update_events_attr which will perform some operations on the members of the ei struct when ei is NULL. Hence,When ei->is_freed is set,return NULL directly. Link: https://lore.kernel.org/linux-trace-kernel/20240513053338.63017-1-hao.ge@linux.dev Cc: stable@vger.kernel.org Fixes: 8186fff7ab64 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Hao Ge <gehao@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUSAnna Schumaker1-1/+1
commit f06d1b10cb016d5aaecdb1804fefca025387bd10 upstream. Olga showed me a case where the client was sending multiple READ_PLUS calls to the server in parallel, and the server replied NFS4ERR_OPNOTSUPP to each. The client would fall back to READ for the first reply, but fail to retry the other calls. I fix this by removing the test for NFS_CAP_READ_PLUS in nfs4_read_plus_not_supported(). This allows us to reschedule any READ_PLUS call that has a NFS4ERR_OPNOTSUPP return value, even after the capability has been cleared. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c567552612ec ("NFS: Add READ_PLUS data segment support") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16nfs: fix undefined behavior in nfs_block_bits()Sergey Shtylyov1-2/+2
commit 3c0a2e0b0ae661457c8505fecc7be5501aa7a715 upstream. Shifting *signed int* typed constant 1 left by 31 bits causes undefined behavior. Specify the correct *unsigned long* type by using 1UL instead. Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Cc: stable@vger.kernel.org Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16cifs: fix creating sockets when using sfu mount optionsSteve French3-1/+8
commit 518549c120e671c4906f77d1802b97e9b23f673a upstream. When running fstest generic/423 with sfu mount option, it was being skipped due to inability to create sockets: generic/423 [not run] cifs does not support mknod/mkfifo which can also be easily reproduced with their af_unix tool: ./src/af_unix /mnt1/socket-two bind: Operation not permitted Fix sfu mount option to allow creating and reporting sockets. Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find()Baokun Li1-1/+3
commit 0c0b4a49d3e7f49690a6827a41faeffad5df7e21 upstream. Syzbot reports a warning as follows: ============================================ WARNING: CPU: 0 PID: 5075 at fs/mbcache.c:419 mb_cache_destroy+0x224/0x290 Modules linked in: CPU: 0 PID: 5075 Comm: syz-executor199 Not tainted 6.9.0-rc6-gb947cc5bf6d7 RIP: 0010:mb_cache_destroy+0x224/0x290 fs/mbcache.c:419 Call Trace: <TASK> ext4_put_super+0x6d4/0xcd0 fs/ext4/super.c:1375 generic_shutdown_super+0x136/0x2d0 fs/super.c:641 kill_block_super+0x44/0x90 fs/super.c:1675 ext4_kill_sb+0x68/0xa0 fs/ext4/super.c:7327 [...] ============================================ This is because when finding an entry in ext4_xattr_block_cache_find(), if ext4_sb_bread() returns -ENOMEM, the ce's e_refcnt, which has already grown in the __entry_find(), won't be put away, and eventually trigger the above issue in mb_cache_destroy() due to reference count leakage. So call mb_cache_entry_put() on the -ENOMEM error branch as a quick fix. Reported-by: syzbot+dd43bd0f7474512edc47@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=dd43bd0f7474512edc47 Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16ext4: set type of ac_groups_linear_remaining to __u32 to avoid overflowBaokun Li1-1/+1
commit 9a9f3a9842927e4af7ca10c19c94dad83bebd713 upstream. Now ac_groups_linear_remaining is of type __u16 and s_mb_max_linear_groups is of type unsigned int, so an overflow occurs when setting a value above 65535 through the mb_max_linear_groups sysfs interface. Therefore, the type of ac_groups_linear_remaining is set to __u32 to avoid overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16ext4: Fixes len calculation in mpage_journal_page_buffersRitesh Harjani (IBM)1-1/+1
commit c2a09f3d782de952f09a3962d03b939e7fa7ffa4 upstream. Truncate operation can race with writeback, in which inode->i_size can get truncated and therefore size - folio_pos() can be negative. This fixes the len calculation. However this path doesn't get easily triggered even with data journaling. Cc: stable@kernel.org # v6.5 Fixes: 80be8c5cc925 ("Fixes: ext4: Make mpage_journal_page_buffers use folio") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/cff4953b5c9306aba71e944ab176a5d396b9a1b7.1709182250.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock againYuanyuan Zhong1-2/+7
commit 6d065f507d82307d6161ac75c025111fb8b08a46 upstream. After switching smaps_rollup to use VMA iterator, searching for next entry is part of the condition expression of the do-while loop. So the current VMA needs to be addressed before the continue statement. Otherwise, with some VMAs skipped, userspace observed memory consumption from /proc/pid/smaps_rollup will be smaller than the sum of the corresponding fields from /proc/pid/smaps. Link: https://lkml.kernel.org/r/20240523183531.2535436-1-yzhong@purestorage.com Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end") Signed-off-by: Yuanyuan Zhong <yzhong@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16mm/ksm: fix ksm_zero_pages accountingChengming Zhou1-1/+1
commit c2dc78b86e0821ecf9a9d0c35dba2618279a5bb6 upstream. We normally ksm_zero_pages++ in ksmd when page is merged with zero page, but ksm_zero_pages-- is done from page tables side, where there is no any accessing protection of ksm_zero_pages. So we can read very exceptional value of ksm_zero_pages in rare cases, such as -1, which is very confusing to users. Fix it by changing to use atomic_long_t, and the same case with the mm->ksm_zero_pages. Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Fixes: 6080d19f0704 ("ksm: add ksm zero pages for each process") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Stefan Roesch <shr@devkernel.io> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16iomap: fault in smaller chunks for non-large folio mappingsXu Yang1-1/+1
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream. Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem. If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again: start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case. With this change, the write speed will be stable. Tested on ARM64 device. Before: - dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s) After: - dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s) Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-169p: add missing locking around taking dentry fid listDominique Martinet1-2/+7
commit c898afdc15645efb555acb6d85b484eb40a45409 upstream. Fix a use-after-free on dentry's d_fsdata fid list when a thread looks up a fid through dentry while another thread unlinks it: UAF thread: refcount_t: addition on 0; use-after-free. p9_fid_get linux/./include/net/9p/client.h:262 v9fs_fid_find+0x236/0x280 linux/fs/9p/fid.c:129 v9fs_fid_lookup_with_uid linux/fs/9p/fid.c:181 v9fs_fid_lookup+0xbf/0xc20 linux/fs/9p/fid.c:314 v9fs_vfs_getattr_dotl+0xf9/0x360 linux/fs/9p/vfs_inode_dotl.c:400 vfs_statx+0xdd/0x4d0 linux/fs/stat.c:248 Freed by: p9_fid_destroy (inlined) p9_client_clunk+0xb0/0xe0 linux/net/9p/client.c:1456 p9_fid_put linux/./include/net/9p/client.h:278 v9fs_dentry_release+0xb5/0x140 linux/fs/9p/vfs_dentry.c:55 v9fs_remove+0x38f/0x620 linux/fs/9p/vfs_inode.c:518 vfs_unlink+0x29a/0x810 linux/fs/namei.c:4335 The problem is that d_fsdata was not accessed under d_lock, because d_release() normally is only called once the dentry is otherwise no longer accessible but since we also call it explicitly in v9fs_remove that lock is required: move the hlist out of the dentry under lock then unref its fids once they are no longer accessible. Fixes: 154372e67d40 ("fs/9p: fix create-unlink-getattr idiom") Cc: stable@vger.kernel.org Reported-by: Meysam Firouzi Reported-by: Amirmohammad Eftekhar Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Message-ID: <20240521122947.1080227-1-asmadeus@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operationTyler Hicks (Microsoft)1-22/+20
commit 0a960ba49869ebe8ff859d000351504dd6b93b68 upstream. The following commits loosened the permissions of /proc/<PID>/fdinfo/ directory, as well as the files within it, from 0500 to 0555 while also introducing a PTRACE_MODE_READ check between the current task and <PID>'s task: - commit 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") - commit 1927e498aee1 ("procfs: prevent unprivileged processes accessing fdinfo dir") Before those changes, inode based system calls like inotify_add_watch(2) would fail when the current task didn't have sufficient read permissions: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] This matches the documented behavior in the inotify_add_watch(2) man page: ERRORS EACCES Read access to the given file is not permitted. After those changes, inotify_add_watch(2) started succeeding despite the current task not having PTRACE_MODE_READ privileges on the target task: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = 1757 openat(AT_FDCWD, "/proc/1/task/1/fdinfo", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 EACCES (Permission denied) [...] This change in behavior broke .NET prior to v7. See the github link below for the v7 commit that inadvertently/quietly (?) fixed .NET after the kernel changes mentioned above. Return to the old behavior by moving the PTRACE_MODE_READ check out of the file .open operation and into the inode .permission operation: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] Reported-by: Kevin Parsons (Microsoft) <parsonskev@gmail.com> Link: https://github.com/dotnet/runtime/commit/89e5469ac591b82d38510fe7de98346cce74ad4f Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") Cc: stable@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Hardik Garg <hargar@linux.microsoft.com> Cc: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16fsverity: use register_sysctl_init() to avoid kmemleak warningEric Biggers1-6/+1
commit ee5814dddefbaa181cb247a75676dd5103775db1 upstream. Since the fsverity sysctl registration runs as a builtin initcall, there is no corresponding sysctl deregistration and the resulting struct ctl_table_header is not used. This can cause a kmemleak warning just after the system boots up. (A pointer to the ctl_table_header is stored in the fsverity_sysctl_header static variable, which kmemleak should detect; however, the compiler can optimize out that variable.) Avoid the kmemleak warning by using register_sysctl_init() which is intended for use by builtin initcalls and uses kmemleak_not_leak(). Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/r/CAHj4cs8DTSvR698UE040rs_pX1k-WVe7aR6N2OoXXuhXJPDC-w@mail.gmail.com Cc: stable@vger.kernel.org Reviewed-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240501025331.594183-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16btrfs: qgroup: fix initialization of auto inherit arrayDan Carpenter1-1/+1
commit 0e39c9e524479b85c1b83134df0cfc6e3cb5353a upstream. The "i++" was accidentally left out so it just sets qgids[0] over and over. This can lead to unexpected problems, as the groups[1:] would be all 0, leading to later find_qgroup_rb() unable to find a qgroup and cause snapshot creation failure. Fixes: 5343cd9364ea ("btrfs: qgroup: simple quota auto hierarchy for nested subvolumes") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16f2fs: fix to do sanity check on i_xattr_nid in sanity_check_inode()Chao Yu1-0/+6
commit 20faaf30e55522bba2b56d9c46689233205d7717 upstream. syzbot reports a kernel bug as below: F2FS-fs (loop0): Mounted with checkpoint version = 48b305e4 ================================================================== BUG: KASAN: slab-out-of-bounds in f2fs_test_bit fs/f2fs/f2fs.h:2933 [inline] BUG: KASAN: slab-out-of-bounds in current_nat_addr fs/f2fs/node.h:213 [inline] BUG: KASAN: slab-out-of-bounds in f2fs_get_node_info+0xece/0x1200 fs/f2fs/node.c:600 Read of size 1 at addr ffff88807a58c76c by task syz-executor280/5076 CPU: 1 PID: 5076 Comm: syz-executor280 Not tainted 6.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 f2fs_test_bit fs/f2fs/f2fs.h:2933 [inline] current_nat_addr fs/f2fs/node.h:213 [inline] f2fs_get_node_info+0xece/0x1200 fs/f2fs/node.c:600 f2fs_xattr_fiemap fs/f2fs/data.c:1848 [inline] f2fs_fiemap+0x55d/0x1ee0 fs/f2fs/data.c:1925 ioctl_fiemap fs/ioctl.c:220 [inline] do_vfs_ioctl+0x1c07/0x2e50 fs/ioctl.c:838 __do_sys_ioctl fs/ioctl.c:902 [inline] __se_sys_ioctl+0x81/0x170 fs/ioctl.c:890 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is we missed to do sanity check on i_xattr_nid during f2fs_iget(), so that in fiemap() path, current_nat_addr() will access nat_bitmap w/ offset from invalid i_xattr_nid, result in triggering kasan bug report, fix it. Reported-and-tested-by: syzbot+3694e283cf5c40df6d14@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/00000000000094036c0616e72a1d@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16erofs: avoid allocating DEFLATE streams before mountingGao Xiang1-26/+29
commit 80eb4f62056d6ae709bdd0636ab96ce660f494b2 upstream. Currently, each DEFLATE stream takes one 32 KiB permanent internal window buffer even if there is no running instance which uses DEFLATE algorithm. It's unexpected and wasteful on embedded devices with limited resources and servers with hundreds of CPU cores if DEFLATE is enabled but unused. Fixes: ffa09b3bd024 ("erofs: DEFLATE compression support") Cc: <stable@vger.kernel.org> # 6.6+ Reviewed-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240520090106.2898681-1-hsiangkao@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16afs: Don't cross .backup mountpoint from backup volumeMarc Dionne1-0/+5
commit 29be9100aca2915fab54b5693309bc42956542e5 upstream. Don't cross a mountpoint that explicitly specifies a backup volume (target is <vol>.backup) when starting from a backup volume. It it not uncommon to mount a volume's backup directly in the volume itself. This can cause tools that are not paying attention to get into a loop mounting the volume onto itself as they attempt to traverse the tree, leading to a variety of problems. This doesn't prevent the general case of loops in a sequence of mountpoints, but addresses a common special case in the same way as other afs clients. Reported-by: Jan Henrik Sylvester <jan.henrik.sylvester@uni-hamburg.de> Link: http://lists.infradead.org/pipermail/linux-afs/2024-May/008454.html Reported-by: Markus Suvanto <markus.suvanto@gmail.com> Link: http://lists.infradead.org/pipermail/linux-afs/2024-February/008074.html Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/768760.1716567475@warthog.procyon.org.uk Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-12cifs: Fix missing set of remote_i_sizeDavid Howells2-3/+4
[ Upstream commit 93a43155127fec0f8cc942d63b76668c2f8f69fa ] Occasionally, the generic/001 xfstest will fail indicating corruption in one of the copy chains when run on cifs against a server that supports FSCTL_DUPLICATE_EXTENTS_TO_FILE (eg. Samba with a share on btrfs). The problem is that the remote_i_size value isn't updated by cifs_setsize() when called by smb2_duplicate_extents(), but i_size *is*. This may cause cifs_remap_file_range() to then skip the bit after calling ->duplicate_extents() that sets sizes. Fix this by calling netfs_resize_file() in smb2_duplicate_extents() before calling cifs_setsize() to set i_size. This means we don't then need to call netfs_resize_file() upon return from ->duplicate_extents(), but we also fix the test to compare against the pre-dup inode size. [Note that this goes back before the addition of remote_i_size with the netfs_inode struct. It should probably have been setting cifsi->server_eof previously.] Fixes: cfc63fc8126a ("smb3: fix cached file size problems in duplicate extents (reflink)") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12cifs: Set zero_point in the copy_file_range() and remap_file_range()David Howells1-0/+6
[ Upstream commit 3758c485f6c9124d8ad76b88382004cbc28a0892 ] Set zero_point in the copy_file_range() and remap_file_range() implementations so that we don't skip reading data modified on a server-side copy. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Stable-dep-of: 93a43155127f ("cifs: Fix missing set of remote_i_size") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12netfs: Fix setting of BDP_ASYNC from iocb flagsDavid Howells1-1/+1
[ Upstream commit c596bea1452ddf172ec9b588e4597228e9a1f4d5 ] Fix netfs_perform_write() to set BDP_ASYNC if IOCB_NOWAIT is set rather than if IOCB_SYNC is not set. It reflects asynchronicity in the sense of not waiting rather than synchronicity in the sense of not returning until the op is complete. Without this, generic/590 fails on cifs in strict caching mode with a complaint that one of the writes fails with EAGAIN. The test can be distilled down to: mount -t cifs /my/share /mnt -ostuff xfs_io -i -c 'falloc 0 8191M -c fsync -f /mnt/file xfs_io -i -c 'pwrite -b 1M -W 0 8191M' /mnt/file Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/316306.1716306586@warthog.procyon.org.uk Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Jeff Layton <jlayton@kernel.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12pNFS/filelayout: fixup pNfs allocation modesOlga Kornievskaia1-2/+2
[ Upstream commit 3ebcb24646f8c5bfad2866892d3f3cff05514452 ] Change left over allocation flags. Fixes: a245832aaa99 ("pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12nfs: keep server info for remountsMartin Kaiser1-3/+6
[ Upstream commit b322bf9e983addedff0894c55e92d58f4d16d92a ] With newer kernels that use fs_context for nfs mounts, remounts fail with -EINVAL. $ mount -t nfs -o nolock 10.0.0.1:/tmp/test /mnt/test/ $ mount -t nfs -o remount /mnt/test/ mount: mounting 10.0.0.1:/tmp/test on /mnt/test failed: Invalid argument For remounts, the nfs server address and port are populated by nfs_init_fs_context and later overwritten with 0x00 bytes by nfs23_parse_monolithic. The remount then fails as the server address is invalid. Fix this by not overwriting nfs server info in nfs23_parse_monolithic if we're doing a remount. Fixes: f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Martin Kaiser <martin@kaiser.cx> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12NFSv4: Fixup smatch warning for ambiguous returnBenjamin Coddington1-7/+5
[ Upstream commit 37ffe06537af3e3ec212e7cbe941046fce0a822f ] Dan Carpenter reports smatch warning for nfs4_try_migration() when a memory allocation failure results in a zero return value. In this case, a transient allocation failure error will likely be retried the next time the server responds with NFS4ERR_MOVED. We can fixup the smatch warning with a small refactor: attempt all three allocations before testing and returning on a failure. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: c3ed222745d9 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12fs/ntfs3: Use variable length array instead of fixed sizeKonstantin Komarov1-1/+1
[ Upstream commit 1997cdc3e727526aa5d84b32f7cbb3f56459b7ef ] Should fix smatch warning: ntfs_set_label() error: __builtin_memcpy() 'uni->name' too small (20 vs 256) Fixes: 4534a70b7056f ("fs/ntfs3: Add headers and misc files") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202401091421.3RJ24Mn3-lkp@intel.com/ Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12fs/ntfs3: Use 64 bit variable to avoid 32 bit overflowKonstantin Komarov1-1/+2
[ Upstream commit e931f6b630ffb22d66caab202a52aa8cbb10c649 ] For example, in the expression: vbo = 2 * vbo + skip Fixes: b46acd6a6a627 ("fs/ntfs3: Add NTFS journal") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12fs/ntfs3: Check 'folio' pointer for NULLKonstantin Komarov1-6/+11
[ Upstream commit 1cd6c96219c429ebcfa8e79a865277376c563803 ] It can be NULL if bmap is called. Fixes: 82cae269cfa95 ("fs/ntfs3: Add initialization of super block") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12ocfs2: correctly use ocfs2_find_next_zero_bit()Joseph Qi3-18/+9
[ Upstream commit 30dd3478c3cd7d01cc5afc4952e885ba4eefb730 ] If no bits are zero, ocfs2_find_next_zero_bit() will return max size, so check the return value with -1 is meaningless. Correct this usage and cleanup the code. Link: https://lkml.kernel.org/r/20240314021713.240796-1-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 28d2188709d9 ("selftests/harness: use 1024 in place of LINE_MAX") Signed-off-by: Sasha Levin <sashal@kernel.org>