summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-11-20eventfs: Check for NULL ef in eventfs_set_attr()Steven Rostedt (Google)1-2/+2
The top level events directory dentry does not have a d_fsdata set to a eventfs_file pointer. This dentry is still passed to eventfs_set_attr(). It can not assume that the d_fsdata is set. Check for that. Link: https://lore.kernel.org/all/20231112104158.6638-1-milian.wolff@kdab.com/ Fixes: 9aaee3eebc91 ("eventfs: Save ownership and mode") Reported-by: Milian Wolff <milian.wolff@kdab.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20fs: dlm: Simplify buffer size computation in dlm_create_debug_file()Christophe JAILLET1-5/+5
[ Upstream commit 19b3102c0b5350621e7492281f2be0f071fb7e31 ] Use sizeof(name) instead of the equivalent, but hard coded, DLM_LOCKSPACE_LEN + 8. This is less verbose and more future proof. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20f2fs: fix to initialize map.m_pblk in f2fs_precache_extents()Chao Yu1-0/+1
[ Upstream commit 8b07c1fb0f1ad139373c8253f2fad8bc43fab07d ] Otherwise, it may print random physical block address in tracepoint of f2fs_map_blocks() as below: f2fs_map_blocks: dev = (253,16), ino = 2297, file offset = 0, start blkaddr = 0xa356c421, len = 0x0, flags = 0 Fixes: c4020b2da4c9 ("f2fs: support F2FS_IOC_PRECACHE_EXTENTS") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20debugfs: Fix __rcu type comparison warningMike Tipton1-1/+1
[ Upstream commit 7360a48bd0f5e62b2d00c387d5d3f2821eb290ce ] Sparse reports the following: fs/debugfs/file.c:942:9: sparse: sparse: incompatible types in comparison expression (different address spaces): fs/debugfs/file.c:942:9: sparse: char [noderef] __rcu * fs/debugfs/file.c:942:9: sparse: char * rcu_assign_pointer() expects that it's assigning to pointers annotated with __rcu. We can't annotate the generic struct file::private_data, so cast it instead. Fixes: 86b5488121db ("debugfs: Add write support to debugfs_create_str()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309091933.BRWlSnCq-lkp@intel.com/ Signed-off-by: Mike Tipton <quic_mdtipton@quicinc.com> Link: https://lore.kernel.org/r/20230922134512.5126-1-quic_mdtipton@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20f2fs: fix to drop meta_inode's page cache in f2fs_put_super()Chao Yu1-1/+1
[ Upstream commit a4639380bbe66172df329f8b54aa7d2e943f0f64 ] syzbot reports a kernel bug as below: F2FS-fs (loop1): detect filesystem reference count leak during umount, type: 10, count: 1 kernel BUG at fs/f2fs/super.c:1639! CPU: 0 PID: 15451 Comm: syz-executor.1 Not tainted 6.5.0-syzkaller-09338-ge0152e7481c6 #0 RIP: 0010:f2fs_put_super+0xce1/0xed0 fs/f2fs/super.c:1639 Call Trace: generic_shutdown_super+0x161/0x3c0 fs/super.c:693 kill_block_super+0x3b/0x70 fs/super.c:1646 kill_f2fs_super+0x2b7/0x3d0 fs/f2fs/super.c:4879 deactivate_locked_super+0x9a/0x170 fs/super.c:481 deactivate_super+0xde/0x100 fs/super.c:514 cleanup_mnt+0x222/0x3d0 fs/namespace.c:1254 task_work_run+0x14d/0x240 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x60 kernel/entry/common.c:296 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd In f2fs_put_super(), it tries to do sanity check on dirty and IO reference count of f2fs, once there is any reference count leak, it will trigger panic. The root case is, during f2fs_put_super(), if there is any IO error in f2fs_wait_on_all_pages(), we missed to truncate meta_inode's page cache later, result in panic, fix this case. Fixes: 20872584b8c0 ("f2fs: fix to drop all dirty meta/node pages during umount()") Reported-by: syzbot+ebd7072191e2eddd7d6e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000a14f020604a62a98@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20f2fs: compress: fix to avoid redundant compress extensionChao Yu1-0/+33
[ Upstream commit 7e1b150fece033703a824df1bbc03df091ea53cc ] With below script, redundant compress extension will be parsed and added by parse_options(), because parse_options() doesn't check whether the extension is existed or not, fix it. 1. mount -t f2fs -o compress_extension=so /dev/vdb /mnt/f2fs 2. mount -t f2fs -o remount,compress_extension=so /mnt/f2fs 3. mount|grep f2fs /dev/vdb on /mnt/f2fs type f2fs (...,compress_extension=so,compress_extension=so,...) Fixes: 4c8ff7095bef ("f2fs: support data compression") Fixes: 151b1982be5d ("f2fs: compress: add nocompress extensions support") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20f2fs: compress: fix to avoid use-after-free on dicChao Yu1-1/+3
[ Upstream commit b0327c84e91a0f4f0abced8cb83ec86a7083f086 ] Call trace: __memcpy+0x128/0x250 f2fs_read_multi_pages+0x940/0xf7c f2fs_mpage_readpages+0x5a8/0x624 f2fs_readahead+0x5c/0x110 page_cache_ra_unbounded+0x1b8/0x590 do_sync_mmap_readahead+0x1dc/0x2e4 filemap_fault+0x254/0xa8c f2fs_filemap_fault+0x2c/0x104 __do_fault+0x7c/0x238 do_handle_mm_fault+0x11bc/0x2d14 do_mem_abort+0x3a8/0x1004 el0_da+0x3c/0xa0 el0t_64_sync_handler+0xc4/0xec el0t_64_sync+0x1b4/0x1b8 In f2fs_read_multi_pages(), once f2fs_decompress_cluster() was called if we hit cached page in compress_inode's cache, dic may be released, it needs break the loop rather than continuing it, in order to avoid accessing invalid dic pointer. Fixes: 6ce19aff0b8c ("f2fs: compress: add compress_inode to cache compressed blocks") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20f2fs: compress: fix deadloop in f2fs_write_cache_pages()Chao Yu1-2/+18
[ Upstream commit c5d3f9b7649abb20aa5ab3ebff9421a171eaeb22 ] With below mount option and testcase, it hangs kernel. 1. mount -t f2fs -o compress_log_size=5 /dev/vdb /mnt/f2fs 2. touch /mnt/f2fs/file 3. chattr +c /mnt/f2fs/file 4. dd if=/dev/zero of=/mnt/f2fs/file bs=1MB count=1 5. sync 6. dd if=/dev/zero of=/mnt/f2fs/file bs=111 count=11 conv=notrunc 7. sync INFO: task sync:4788 blocked for more than 120 seconds. Not tainted 6.5.0-rc1+ #322 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:sync state:D stack:0 pid:4788 ppid:509 flags:0x00000002 Call Trace: <TASK> __schedule+0x335/0xf80 schedule+0x6f/0xf0 wb_wait_for_completion+0x5e/0x90 sync_inodes_sb+0xd8/0x2a0 sync_inodes_one_sb+0x1d/0x30 iterate_supers+0x99/0xf0 ksys_sync+0x46/0xb0 __do_sys_sync+0x12/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 The reason is f2fs_all_cluster_page_ready() assumes that pages array should cover at least one cluster, otherwise, it will always return false, result in deadloop. By default, pages array size is 16, and it can cover the case cluster_size is equal or less than 16, for the case cluster_size is larger than 16, let's allocate memory of pages array dynamically. Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20erofs: fix erofs_insert_workgroup() lockref usageGao Xiang2-7/+2
[ Upstream commit 1a0ac8bd7a4fa5b2f4ef14c3b1e9d6e5a5faae06 ] As Linus pointed out [1], lockref_put_return() is fundamentally designed to be something that can fail. It behaves as a fastpath-only thing, and the failure case needs to be handled anyway. Actually, since the new pcluster was just allocated without being populated, it won't be accessed by others until it is inserted into XArray, so lockref helpers are actually unneeded here. Let's just set the proper reference count on initializing. [1] https://lore.kernel.org/r/CAHk-=whCga8BeQnJ3ZBh_Hfm9ctba_wpF444LpwRybVNMzO6Dw@mail.gmail.com Fixes: 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach") Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20231031060524.1103921-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dlm: fix no ack after final messageAlexander Aring1-3/+3
[ Upstream commit 6212e4528b248a4bc9b4fe68e029a84689c67461 ] In case of an final DLM message we can't should not send an ack out after the final message. This patch moves the ack message before the messages will be transmitted. If it's the final message and the receiving node turns into DLM_CLOSED state another ack messages will being received and turning the receiving node into DLM_ESTABLISHED again. Fixes: 1696c75f1864 ("fs: dlm: add send ack threshold and append acks to msgs") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dlm: be sure we reset all nodes at forced shutdownAlexander Aring1-2/+8
[ Upstream commit e759eb3e27e5b624930548f1c0eda90da6e26ee9 ] In case we running in a force shutdown in either midcomms or lowcomms implementation we will make sure we reset all per midcomms node information. Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dlm: fix remove member after close callAlexander Aring1-1/+12
[ Upstream commit 2776635edc7fcd62e03cb2efb93c31f685887460 ] The idea of commit 63e711b08160 ("fs: dlm: create midcomms nodes when configure") is to set the midcomms node lifetime when a node joins or leaves the cluster. Currently we can hit the following warning: [10844.611495] ------------[ cut here ]------------ [10844.615913] WARNING: CPU: 4 PID: 84304 at fs/dlm/midcomms.c:1263 dlm_midcomms_remove_member+0x13f/0x180 [dlm] or running in a state where we hit a midcomms node usage count in a negative value: [ 260.830782] node 2 users dec count -1 The first warning happens when the a specific node does not exists and it was probably removed but dlm_midcomms_close() which is called when a node leaves the cluster. The second kernel log message is probably in a case when dlm_midcomms_addr() is called when a joined the cluster but due fencing a node leaved the cluster without getting removed from the lockspace. If the node joins the cluster and it was removed from the cluster due fencing the first call is to remove the node from lockspaces triggered by the user space. In both cases if the node wasn't found or the user count is zero, we should ignore any additional midcomms handling of dlm_midcomms_remove_member(). Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dlm: fix creating multiple node structuresAlexander Aring1-1/+9
[ Upstream commit fe9b619e6e94acf0b068fb1a8f658f5a96b8fad7 ] This patch will lookup existing nodes instead of always creating them when dlm_midcomms_addr() is called. The idea is here to create midcomms nodes when user space getting informed that nodes joins the cluster. This is the case when dlm_midcomms_addr() is called, however it can be called multiple times by user space to add several address configurations to one node e.g. when using SCTP. Those multiple times need to be filtered out and we doing that by looking up if the node exists before. Due configfs entry it is safe that this function gets only called once at a time. Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20fs: dlm: Fix the size of a buffer in dlm_create_debug_file()Christophe JAILLET1-1/+2
[ Upstream commit b859e01054354033f480d9df41b0ebc2c7537379 ] 8 is not the maximum size of the suffix used when creating debugfs files. Let the compiler compute the correct size, and only give a hint about the longest possible string that is used. When building with W=1, this fixes the following warnings: fs/dlm/debug_fs.c: In function ‘dlm_create_debug_file’: fs/dlm/debug_fs.c:1020:58: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=] 1020 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_waiters", ls->ls_name); | ^ fs/dlm/debug_fs.c:1020:9: note: ‘snprintf’ output between 9 and 73 bytes into a destination of size 72 1020 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_waiters", ls->ls_name); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/dlm/debug_fs.c:1031:50: error: ‘_queued_asts’ directive output may be truncated writing 12 bytes into a region of size between 8 and 72 [-Werror=format-truncation=] 1031 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_queued_asts", ls->ls_name); | ^~~~~~~~~~~~ fs/dlm/debug_fs.c:1031:9: note: ‘snprintf’ output between 13 and 77 bytes into a destination of size 72 1031 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_queued_asts", ls->ls_name); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 541adb0d4d10b ("fs: dlm: debugfs for queued callbacks") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20ext4: move 'ix' sanity check to corrent positionGou Hao1-5/+5
[ Upstream commit af90a8f4a09ec4a3de20142e37f37205d4687f28 ] Check 'ix' before it is used. Fixes: 80e675f906db ("ext4: optimize memmmove lengths in extent/index insertions") Signed-off-by: Gou Hao <gouhao@uniontech.com> Link: https://lore.kernel.org/r/20230906013341.7199-1-gouhao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20ext4: add missing initialization of call_notify_error in update_super_work()Theodore Ts'o1-1/+2
[ Upstream commit ee6a12d0d4d85f3833d177cd382cd417f0ef011b ] Fixes: ff0722de896e ("ext4: add periodic superblock update check") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20pstore/platform: Add check for kstrdupJiasheng Jiang1-1/+8
[ Upstream commit a19d48f7c5d57c0f0405a7d4334d1d38fe9d3c1c ] Add check for the return value of kstrdup() and return the error if it fails in order to avoid NULL pointer dereference. Fixes: 563ca40ddf40 ("pstore/platform: Switch pstore_info::name to const") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Link: https://lore.kernel.org/r/20230623022706.32125-1-jiasheng@iscas.ac.cn Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbsJingbo Xu1-12/+29
[ Upstream commit 6654408a33e6297d8e1d2773409431d487399b95 ] The cgwb cleanup routine will try to release the dying cgwb by switching the attached inodes. It fetches the attached inodes from wb->b_attached list, omitting the fact that inodes only with dirty timestamps reside in wb->b_dirty_time list, which is the case when lazytime is enabled. This causes enormous zombie memory cgroup when lazytime is enabled, as inodes with dirty timestamps can not be switched to a live cgwb for a long time. It is reasonable not to switch cgwb for inodes with dirty data, as otherwise it may break the bandwidth restrictions. However since the writeback of inode metadata is not accounted for, let's also switch inodes with dirty timestamps to avoid zombie memory and block cgroups when laztytime is enabled. Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20231014125511.102978-1-jefflexu@linux.alibaba.com Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20nfsd: Handle EOPENSTALE correctly in the filecacheTrond Myklebust3-25/+34
[ Upstream commit d59b3515ab021e010fdc58a8f445ea62dd2f7f4c ] The nfsd_open code handles EOPENSTALE correctly, by retrying the call to fh_verify() and __nfsd_open(). However the filecache just drops the error on the floor, and immediately returns nfserr_stale to the caller. This patch ensures that we propagate the EOPENSTALE code back to nfsd_file_do_acquire, and that we handle it correctly. Fixes: 65294c1f2c5e ("nfsd: add a new struct file caching facility to nfsd") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230911183027.11372-1-trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-08eventfs: Use simple_recursive_removal() to clean up dentriesSteven Rostedt (Google)1-38/+33
commit 407c6726ca71b33330d2d6345d9ea7ebc02575e9 upstream Looking at how dentry is removed via the tracefs system, I found that eventfs does not do everything that it did under tracefs. The tracefs removal of a dentry calls simple_recursive_removal() that does a lot more than a simple d_invalidate(). As it should be a requirement that any eventfs_inode that has a dentry, so does its parent. When removing a eventfs_inode, if it has a dentry, a call to simple_recursive_removal() on that dentry should clean up all the dentries underneath it. Add WARN_ON_ONCE() to check for the parent having a dentry if any children do. Link: https://lore.kernel.org/all/20231101022553.GE1957730@ZenIV/ Link: https://lkml.kernel.org/r/20231101172650.552471568@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: 5bdcd5f5331a2 ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-08eventfs: Delete eventfs_inode when the last dentry is freedSteven Rostedt (Google)1-76/+74
commit 020010fbfa202aa528a52743eba4ab0da3400a4e upstream There exists a race between holding a reference of an eventfs_inode dentry and the freeing of the eventfs_inode. If user space has a dentry held long enough, it may still be able to access the dentry's eventfs_inode after it has been freed. To prevent this, have he eventfs_inode freed via the last dput() (or via RCU if the eventfs_inode does not have a dentry). This means reintroducing the eventfs_inode del_list field at a temporary place to put the eventfs_inode. It needs to mark it as freed (via the list) but also must invalidate the dentry immediately as the return from eventfs_remove_dir() expects that they are. But the dentry invalidation must not be called under the eventfs_mutex, so it must be done after the eventfs_inode is marked as free (put on a deletion list). Link: https://lkml.kernel.org/r/20231101172650.123479767@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 5bdcd5f5331a2 ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-08eventfs: Save ownership and modeSteven Rostedt (Google)1-16/+91
commit 28e12c09f5aa081b2d13d1340e3610070b6c624d upstream Now that inodes and dentries are created on the fly, they are also reclaimed on memory pressure. Since the ownership and file mode are saved in the inode, if they are freed, any changes to the ownership and mode will be lost. To counter this, if the user changes the permissions or ownership, save them, and when creating the inodes again, restore those changes. Link: https://lkml.kernel.org/r/20231101172649.691841445@goodmis.org Cc: stable@vger.kernel.org Cc: Ajay Kaher <akaher@vmware.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open functions") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-08eventfs: Remove "is_freed" union with rcu headSteven Rostedt (Google)1-3/+5
commit f2f496370afcbc5227d7002da28c74b91fed12ff upstream The eventfs_inode->is_freed was a union with the rcu_head with the assumption that when it was on the srcu list the head would contain a pointer which would make "is_freed" true. But that was a wrong assumption as the rcu head is a single link list where the last element is NULL. Instead, split the nr_entries integer so that "is_freed" is one bit and the nr_entries is the next 31 bits. As there shouldn't be more than 10 (currently there's at most 5 to 7 depending on the config), this should not be a problem. Link: https://lkml.kernel.org/r/20231101172649.049758712@goodmis.org Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open functions") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-28Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+1
Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-28ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lockAl Viro1-1/+1
Use of dget() after we'd dropped ->d_lock is too late - dentry might be gone by that point. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-24Merge tag 'pull-nfsd-fix' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull nfsd fix from Al Viro: "Catch from lock_rename() audit; nfsd_rename() checked that both directories belonged to the same filesystem, but only after having done lock_rename(). Trivial fix, tested and acked by nfs folks" * tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd: lock_rename() needs both directories to live on the same fs
2023-10-23Merge tag 'for-6.6-rc7-tag' of ↵Linus Torvalds5-15/+33
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix for a problem with snapshot of a newly created subvolume that can lead to inconsistent data under some circumstances. Kernel 6.5 added a performance optimization to skip transaction commit for subvolume creation but this could end up with newer data on disk but not linked to other structures. The fix itself is an added condition, the rest of the patch is a parameter added to several functions" * tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix unwritten extent buffer after snapshotting a new subvolume
2023-10-23btrfs: fix unwritten extent buffer after snapshotting a new subvolumeFilipe Manana5-15/+33
When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following: 1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0); 2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create"); 3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33; 4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896. The commit root remains as 36007936, since we are still at transaction 33; 5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit. There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet. The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936. After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state. Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow(). At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic: int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; return 0; } Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root. As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1. Later on, in the final if-else statement at update_ref_for_cow(): static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } } So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it. Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written: $ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 The respective on disk region is full of zeroes as the device was trimmed at mkfs time. Obviously 'btrfs check' also detects and complains about this: $ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088 Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation. This can be reproduced with the following script: $ cat test.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi # Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT btrfs subvolume create $MNT/subvol # Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap umount $DEV btrfs check $DEV Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment): $ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0 A test case for fstests will follow soon. Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-21Merge tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-3/+7
Pull iomap fix from Darrick Wong: - Fix a bug where a writev consisting of a bunch of sub-fsblock writes where the last buffer address is invalid could lead to an infinite loop * tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fix short copy in iomap_write_iter()
2023-10-21Merge tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds5-21/+38
Pull NFS client fixes from Anna Schumaker: "Stable Fix: - Fix a pNFS hang in nfs4_evict_inode() Fixes: - Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op - Fix a potential oops in nfs_inode_remove_request() - Check the validity of the layout pointer in ff_layout_mirror_prepare_stats() - Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases" * tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats pNFS: Fix a hang in nfs4_evict_inode() NFS: Fix potential oops in nfs_inode_remove_request() nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op
2023-10-21Merge tag 'fsnotify_for_v6.6-rc7' of ↵Linus Torvalds1-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fanotify fix from Jan Kara: "Disable superblock / mount marks for filesystems that can encode file handles but not open them (currently only overlayfs). It is not clear the functionality is useful in any way so let's better disable it before someone comes up with some creative misuse" * tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: limit reporting of event with non-decodeable file handles
2023-10-19iomap: fix short copy in iomap_write_iter()Jan Stancek1-3/+7
Starting with commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace"), iomap_write_iter() can get into endless loop. This can be reproduced with LTP writev07 which uses partially valid iovecs: struct iovec wr_iovec[] = { { buffer, 64 }, { bad_addr, 64 }, { buffer + 64, 64 }, { buffer + 64 * 2, 64 }, }; commit bc1bb416bbb9 ("generic_perform_write()/iomap_write_actor(): saner logics for short copy") previously introduced the logic, which made short copy retry in next iteration with amount of "bytes" it managed to copy: if (unlikely(status == 0)) { /* * A short copy made iomap_write_end() reject the * thing entirely. Might be memory poisoning * halfway through, might be a race with munmap, * might be severe memory pressure. */ if (copied) bytes = copied; However, since 5d8edfb900d5 "bytes" is no longer carried into next iteration, because it is now always initialized at the beginning of the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with same value as previous iteration, making the loop retry same copy over and over, which leads to writev07 testcase hanging. Make next iteration retry with amount of bytes we managed to copy. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-19Merge tag 'v6.6-rc7.vfs.fixes' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fix from Christian Brauner: "An openat() call from io_uring triggering an audit call can apparently cause the refcount of struct filename to be incremented from multiple threads concurrently during async execution, triggering a refcount underflow and hitting a BUG_ON(). That bug has been lurking around since at least v5.16 apparently. Switch to an atomic counter to fix that. The underflow check is downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily remove that check altogether tbh" * tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: audit,io_uring: io_uring openat triggers audit reference count underflow
2023-10-19Merge tag 'ntfs3_for_6.6' of ↵Linus Torvalds16-82/+197
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 fixes from Konstantin Komarov: - memory leak - some logic errors, NULL dereferences - some code was refactored - more sanity checks * tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3: fs/ntfs3: Avoid possible memory leak fs/ntfs3: Fix directory element type detection fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e() fs/ntfs3: Fix OOB read in ntfs_init_from_boot fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea() fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame() fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr() fs/ntfs3: Do not allow to change label if volume is read-only fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo fs/ntfs3: Refactoring and comments fs/ntfs3: Fix alternative boot searching fs/ntfs3: Allow repeated call to ntfs3_put_sbi fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super fs/ntfs3: fix deadlock in mark_as_free_ex fs/ntfs3: Add more attributes checks in mi_enum_attr() fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN) fs/ntfs3: Write immediately updated ntfs state fs/ntfs3: Add ckeck in ni_update_parent()
2023-10-19Merge tag 'for-6.6-rc6-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Fix a bug in chunk size decision that could lead to suboptimal placement and filling patterns" * tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix stripe length calculation for non-zoned data chunk allocation
2023-10-19fanotify: limit reporting of event with non-decodeable file handlesAmir Goldstein1-8/+17
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file handles") merged in v6.5-rc1, added the ability to use an fanotify group with FAN_REPORT_FID mode to watch filesystems that do not support nfs export, but do know how to encode non-decodeable file handles, with the newly introduced AT_HANDLE_FID flag. At the time that this commit was merged, there were no filesystems in-tree with those traits. Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"), merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify watching of overlayfs with FAN_REPORT_FID mode. In retrospect, allowing an fanotify filesystem/mount mark on such filesystem in FAN_REPORT_FID mode will result in getting events with file handles, without the ability to resolve the filesystem objects from those file handles (i.e. no open_by_handle_at() support). For v6.6, the safer option would be to allow this mode for inode marks only, where the caller has the opportunity to use name_to_handle_at() at the time of setting the mark. In the future we can revise this decision. Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia1-2/+0
This patches fixes commit 51d674a5e488 "NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that commit was to mark EXCHANGE_ID to the DS with the appropriate flag. However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that to the MDS. Instead, just mark the pnfs path exclusively. Fixes: 51d674a5e488 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_statsTrond Myklebust1-7/+10
Ensure that we check the layout pointer and validity after dereferencing it in ff_layout_mirror_prepare_stats. Fixes: 08e2e5bc6c9a ("pNFS/flexfiles: Clean up layoutstats") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18pNFS: Fix a hang in nfs4_evict_inode()Trond Myklebust1-10/+23
We are not allowed to call pnfs_mark_matching_lsegs_return() without also holding a reference to the layout header, since doing so could lead to the reference count going to zero when we call pnfs_layout_remove_lseg(). This again can lead to a hang when we get to nfs4_evict_inode() and are unable to clear the layout pointer. pnfs_layout_return_unused_byserver() is guilty of this behaviour, and has been seen to trigger the refcount warning prior to a hang. Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-17nfsd: lock_rename() needs both directories to live on the same fsAl Viro1-6/+6
... checking that after lock_rename() is too late. Incidentally, NFSv2 had no nfserr_xdev... Fixes: aa387d6ce153 "nfsd: fix EXDEV checking in rename" Cc: stable@vger.kernel.org # v3.9+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-15btrfs: fix stripe length calculation for non-zoned data chunk allocationZygo Blaxell1-1/+1
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct" broke data chunk allocations on non-zoned multi-device filesystems when using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk size and stripe length calculation" partially fixed that, and this patch completes the fix for that case. After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for a data chunk allocation on a non-zoned filesystem is: 1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was 1 GiB for non-zoned data chunks and not configurable. 2. btrfs_create_chunk calls gather_device_info which consumes and produces more fields of chunk_ctl. 3. gather_device_info multiplies ctl->max_stripe_len by ctl->dev_stripes (which is 1 in all cases except dup) and calls find_free_dev_extent with that number as num_bytes. 4. find_free_dev_extent locates the first dev_extent hole on a device which is at least as large as num_bytes. With default max_chunk_size from f6fca3917b4d, it finds the first hole which is longer than 10 GiB, or the largest hole if that hole is shorter than 10 GiB. This is different from the pre-f6fca3917b4d behavior, where num_bytes is 1 GiB, and find_free_dev_extent may choose a different hole. 5. gather_device_info repeats step 4 with all devices to find the first or largest dev_extent hole that can be allocated on each device. 6. gather_device_info sorts the device list by the hole size on each device, using total unallocated space on each device to break ties, then returns to btrfs_create_chunk with the list. 7. btrfs_create_chunk calls decide_stripe_size_regular. 8. decide_stripe_size_regular finds the largest stripe_len that fits across the first nr_devs device dev_extent holes that were found by gather_device_info (and satisfies other constraints on stripe_len that are not relevant here). 9. decide_stripe_size_regular caps the length of the stripe it computed at 1 GiB. This cap appeared in 5da431b71d4b to correct one of the other regressions introduced in f6fca3917b4d. 10. btrfs_create_chunk creates a new chunk with the above computed size and number of devices. At step 4, gather_device_info() has found a location where stripe up to 10 GiB in length could be allocated on several devices, and selected which devices should have a dev_extent allocated on them, but at step 9, only 1 GiB of the space that was found on each device can be used. This mismatch causes new suboptimal chunk allocation cases that did not occur in pre-f6fca3917b4d kernels. Consider a filesystem using raid1 profile with 3 devices. After some balances, device 1 has 10x 1 GiB unallocated space, while devices 2 and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of space, but distributed across different numbers of dev_extent holes. For visualization, let's ignore all the chunks that were allocated before this point, and focus on the remaining holes: Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated) Device 2: [__________] (10 GiB contig unallocated) Device 3: [__________] (10 GiB contig unallocated) Before f6fca3917b4d, the allocator would fill these optimally by allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3 ([13]), or 2 and 3 ([23]): [after 0 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [__________] (10 GiB) Device 3: [__________] (10 GiB) [after 1 chunk allocation] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] Device 2: [12] [_________] (9 GiB) Device 3: [__________] (10 GiB) [after 2 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [12] [_________] (9 GiB) Device 3: [13] [_________] (9 GiB) [after 3 chunk allocations] Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB) Device 2: [12] [12] [________] (8 GiB) Device 3: [13] [_________] (9 GiB) [...] [after 12 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 13 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 14 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB) [after 15 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full) This allocates all of the space with no waste. The sorting function used by gather_device_info considers free space holes above 1 GiB in length to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently long hole on each device, all the holes appear equal in the sort, and the comparison falls back to sorting devices by total free space. This keeps usable space on each device equal so they can all be filled completely. After f6fca3917b4d, the allocator prefers the devices with larger holes over the devices with more free space, so it makes bad allocation choices: [after 1 chunk allocation] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [_________] (9 GiB) Device 3: [23] [_________] (9 GiB) [after 2 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [________] (8 GiB) Device 3: [23] [23] [________] (8 GiB) [after 3 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [_______] (7 GiB) Device 3: [23] [23] [23] [_______] (7 GiB) [...] [after 9 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 10 chunk allocations] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 11 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full) No further allocations are possible, with 8 GiB wasted (4 GiB of data space). The sort in gather_device_info now considers free space in holes longer than 1 GiB to be distinct, so it will prefer devices 2 and 3 over device 1 until all but 1 GiB is allocated on devices 2 and 3. At that point, with only 1 GiB unallocated on every device, the largest hole length on each device is equal at 1 GiB, so the sort finally moves to ordering the devices with the most free space, but by this time it is too late to make use of the free space on device 1. Note that it's possible to contrive a case where the pre-f6fca3917b4d allocator fails the same way, but these cases generally have extensive dev_extent fragmentation as a precondition (e.g. many holes of 768M in length on one device, and few holes 1 GiB in length on the others). With the regression in f6fca3917b4d, bad chunk allocation can occur even under optimal conditions, when all dev_extent holes are exact multiples of stripe_len in length, as in the example above. Also note that post-f6fca3917b4d kernels do treat dev_extent holes larger than 10 GiB as equal, so the bad behavior won't show up on a freshly formatted filesystem; however, as the filesystem ages and fills up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem can show up as a failure to balance after adding or removing devices, or an unexpected shortfall in available space due to unequal allocation. To fix the regression and make data chunk allocation work again, set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk_size if that's smaller (the latter can happen if the user set space_info->chunk_size to less than 1 GiB via sysfs, or it's a 32 MiB system chunk with a hardcoded chunk_size and stripe_len). While researching the background of the earlier commits, I found that an identical fix was already proposed at: https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/ The previous review missed one detail: ctl->max_stripe_len is used before decide_stripe_size_regular() is called, when it is too late for the changes in that function to have any effect. ctl->max_stripe_len is not used directly by decide_stripe_size_regular(), but the parameter does heavily influence the per-device free space data presented to the function. Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") CC: stable@vger.kernel.org # 6.1+ Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-15Merge tag 'ovl-fixes-6.6-rc6' of ↵Linus Torvalds2-69/+84
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: - Various fixes for regressions due to conversion to new mount api in v6.5 - Disable a new mount option syntax (append lowerdir) that was added in v6.5 because we plan to add a different lowerdir append syntax in v6.7 * tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: temporarily disable appending lowedirs ovl: fix regression in showing lowerdir mount option ovl: fix regression in parsing of mount options with escaped comma fs: factor out vfs_parse_monolithic_sep() helper
2023-10-15Merge tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-7/+11
Pull smb server fixes from Steve French: - Fix for possible double free in RPC read - Add additional check to clarify smb2_open path and quiet Coverity - Fix incorrect error rsp in a compounding path - Fix to properly fail open of file with pending delete on close * tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix potential double free on smb2_read_pipe() error path ksmbd: fix Null pointer dereferences in ksmbd_update_fstate() ksmbd: fix wrong error response status by using set_smb2_rsp_status() ksmbd: not allow to open file if delelete on close bit is set
2023-10-15Merge tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-74/+69
Pull smb client fixes from Steve French: - fix caching race with open_cached_dir and laundromat cleanup of cached dirs (addresses a problem spotted with xfstest run with directory leases enabled) - reduce excessive resource usage of laundromat threads * tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: prevent new fids from being removed by laundromat smb: client: make laundromat a delayed worker
2023-10-14ovl: temporarily disable appending lowedirsAmir Goldstein1-49/+3
Kernel v6.5 converted overlayfs to new mount api. As an added bonus, it also added a feature to allow appending lowerdirs using lowerdir=:/lower2,lowerdir=::/data3 syntax. This new syntax has raised some concerns regarding escaping of colons. We decided to try and disable this syntax, which hasn't been in the wild for so long and introduce it again in 6.7 using explicit mount options lowerdir+=/lower2,datadir+=/data3. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/r/CAJfpegsr3A4YgF2YBevWa6n3=AcP7hNndG6EPMu3ncvV-AM71A@mail.gmail.com/ Fixes: b36a5780cb44 ("ovl: modify layer parameter parsing") Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-14Merge tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-5/+16
Pull xfs fixes from Chandan Babu: - Fix calculation of offset of AG's last block and its length - Update incore AG block count when shrinking an AG - Process free extents to busy list in FIFO order - Make XFS report its i_version as the STATX_CHANGE_COOKIE * tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE xfs: Remove duplicate include xfs: correct calculation for agend and blockcount xfs: process free extents to busy list in FIFO order xfs: adjust the incore perag block_count when shrinking
2023-10-14ovl: fix regression in showing lowerdir mount optionAmir Goldstein1-15/+23
Before commit b36a5780cb44 ("ovl: modify layer parameter parsing"), spaces and commas in lowerdir mount option value used to be escaped using seq_show_option(). In current upstream, when lowerdir value has a space, it is not escaped in /proc/mounts, e.g.: none /mnt overlay rw,relatime,lowerdir=l l,upperdir=u,workdir=w 0 0 which results in broken output of the mount utility: none on /mnt type overlay (rw,relatime,lowerdir=l) Store the original lowerdir mount options before unescaping and show them using the same escaping used for seq_show_option() in addition to escaping the colon separator character. Fixes: b36a5780cb44 ("ovl: modify layer parameter parsing") Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-13Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-clientLinus Torvalds3-5/+3
Pull ceph fixes from Ilya Dryomov: "Fixes for an overreaching WARN_ON, two error paths and a switch to kernel_connect() which recently grown protection against someone using BPF to rewrite the address. All but one marked for stable" * tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client: ceph: fix type promotion bug on 32bit systems libceph: use kernel_connect() ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr() ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13audit,io_uring: io_uring openat triggers audit reference count underflowDan Clash1-4/+5
An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below. A call to io_uring_submit() with a single openat op with a flag of IOSQE_ASYNC results in the following reference count updates. These first part of the system call performs two increments that do not race. do_syscall_64() __do_sys_io_uring_enter() io_submit_sqes() io_openat_prep() __io_openat_prep() getname() getname_flags() /* update 1 (increment) */ __audit_getname() /* update 2 (increment) */ The openat op is queued to an io_uring worker thread which starts the opportunity for a race. The system call exit performs one decrement. do_syscall_64() syscall_exit_to_user_mode() syscall_exit_to_user_mode_prepare() __audit_syscall_exit() audit_reset_context() putname() /* update 3 (decrement) */ The io_uring worker thread performs one increment and two decrements. These updates can race with the system call decrement. io_wqe_worker() io_worker_handle_work() io_wq_submit_work() io_issue_sqe() io_openat() io_openat2() do_filp_open() path_openat() __audit_inode() /* update 4 (increment) */ putname() /* update 5 (decrement) */ __audit_uring_exit() audit_reset_context() putname() /* update 6 (decrement) */ The fix is to change the refcnt member of struct audit_names from int to atomic_t. kernel BUG at fs/namei.c:262! Call Trace: ... ? putname+0x68/0x70 audit_reset_context.part.0.constprop.0+0xe1/0x300 __audit_uring_exit+0xda/0x1c0 io_issue_sqe+0x1f3/0x450 ? lock_timer_base+0x3b/0xd0 io_wq_submit_work+0x8d/0x2b0 ? __try_to_del_timer_sync+0x67/0xa0 io_worker_handle_work+0x17c/0x2b0 io_wqe_worker+0x10a/0x350 Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/ Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Dan Clash <daclash@linux.microsoft.com> Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-13ksmbd: fix potential double free on smb2_read_pipe() error pathNamjae Jeon1-1/+1
Fix new smatch warnings: fs/smb/server/smb2pdu.c:6131 smb2_read_pipe() error: double free of 'rpc_resp' Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>