summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-12-01debugfs: fix error when writing negative value to atomic_t debugfs fileAkinobu Mita1-6/+22
The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value for a debugfs file created by debugfs_create_atomic_t(). This restores the previous behaviour by introducing DEFINE_DEBUGFS_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-4-akinobu.mita@gmail.com Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed valueAkinobu Mita1-3/+19
Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), but some attribute files want to accept a negative value. This patch (of 3): The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value. This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19squashfs: fix null-ptr-deref in squashfs_fill_superBaokun Li1-1/+2
When squashfs_read_table() returns an error or `sb->s_magic != SQUASHFS_MAGIC`, enters the error branch and calls msblk->thread_ops->destroy(msblk) to destroy msblk. However, msblk->thread_ops has not been initialized. Therefore, the following problem is triggered: ================================================================== BUG: KASAN: null-ptr-deref in squashfs_fill_super+0xe7a/0x13b0 Read of size 8 at addr 0000000000000008 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3-next-20221031 #367 Call Trace: <TASK> dump_stack_lvl+0x73/0x9f print_report+0x743/0x759 kasan_report+0xc0/0x120 __asan_load8+0xd3/0x140 squashfs_fill_super+0xe7a/0x13b0 get_tree_bdev+0x27b/0x450 squashfs_get_tree+0x19/0x30 vfs_get_tree+0x49/0x150 path_mount+0xaae/0x1350 init_mount+0xad/0x100 do_mount_root+0xbc/0x1d0 mount_block_root+0x173/0x316 mount_root+0x223/0x236 prepare_namespace+0x1eb/0x237 kernel_init_freeable+0x528/0x576 kernel_init+0x29/0x250 ret_from_fork+0x1f/0x30 </TASK> ================================================================== To solve this issue, msblk->thread_ops is initialized immediately after msblk is assigned a value. Link: https://lkml.kernel.org/r/20221101073343.3961562-1-libaokun1@huawei.com Fixes: b0645770d3c7 ("squashfs: add the mount parameter theads=<single|multi|percpu>") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19ocfs2: fix memory leak in ocfs2_stack_glue_init()Shang XiaoJing1-1/+7
ocfs2_table_header should be free in ocfs2_stack_glue_init() if ocfs2_sysfs_init() failed, otherwise kmemleak will report memleak. BUG: memory leak unreferenced object 0xffff88810eeb5800 (size 128): comm "modprobe", pid 4507, jiffies 4296182506 (age 55.888s) hex dump (first 32 bytes): c0 40 14 a0 ff ff ff ff 00 00 00 00 01 00 00 00 .@.............. 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001e59e1cd>] __register_sysctl_table+0xca/0xef0 [<00000000c04f70f7>] 0xffffffffa0050037 [<000000001bd12912>] do_one_initcall+0xdb/0x480 [<0000000064f766c9>] do_init_module+0x1cf/0x680 [<000000002ba52db0>] load_module+0x6441/0x6f20 [<000000009772580d>] __do_sys_finit_module+0x12f/0x1c0 [<00000000380c1f22>] do_syscall_64+0x3f/0x90 [<000000004cf473bc>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Link: https://lkml.kernel.org/r/41651ca1-432a-db34-eb97-d35744559de1@linux.alibaba.com Fixes: 3878f110f71a ("ocfs2: Move the hb_ctl_path sysctl into the stack glue.") Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19squashfs: allows users to configure the number of decompression threadsXiaoming Ni4-10/+66
The maximum number of threads in the decompressor_multi.c file is fixed and cannot be adjusted according to user needs. Therefore, the mount parameter needs to be added to allow users to configure the number of threads as required. The upper limit is num_online_cpus() * 2. Link: https://lkml.kernel.org/r/20221019030930.130456-3-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Jianguo Chen <chenjianguo3@huawei.com> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19squashfs: add the mount parameter theads=<single|multi|percpu>Xiaoming Ni9-32/+147
Patch series 'squashfs: Add the mount parameter "threads="'. Currently, Squashfs supports multiple decompressor parallel modes. However, this mode can be configured only during kernel building and does not support flexible selection during runtime. In the current patch set, the mount parameter "threads=" is added to allow users to select the parallel decompressor mode and configure the number of decompressors when mounting a file system. "threads=<single|multi|percpu|1|2|3|...>" The upper limit is num_online_cpus() * 2. This patch (of 2): Squashfs supports three decompression concurrency modes: Single-thread mode: concurrent reads are blocked and the memory overhead is small. Multi-thread mode/percpu mode: reduces concurrent read blocking but increases memory overhead. The corresponding schema must be fixed at compile time. During mounting, the concurrent decompression mode cannot be adjusted based on file read blocking. The mount parameter theads=<single|multi|percpu> is added to select the concurrent decompression mode of a single SquashFS file system image. Link: https://lkml.kernel.org/r/20221019030930.130456-1-nixiaoming@huawei.com Link: https://lkml.kernel.org/r/20221019030930.130456-2-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Jianguo Chen <chenjianguo3@huawei.com> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19nilfs2: fix shift-out-of-bounds due to too large exponent of block sizeRyusuke Konishi1-4/+38
If field s_log_block_size of superblock data is corrupted and too large, init_nilfs() and load_nilfs() still can trigger a shift-out-of-bounds warning followed by a kernel panic (if panic_on_warn is set): shift exponent 38973 is too large for 32-bit type 'int' Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 ubsan_epilogue+0xb/0x50 __ubsan_handle_shift_out_of_bounds.cold.12+0x17b/0x1f5 init_nilfs.cold.11+0x18/0x1d [nilfs2] nilfs_mount+0x9b5/0x12b0 [nilfs2] ... This fixes the issue by adding and using a new helper function for getting block size with sanity check. Link: https://lkml.kernel.org/r/20221027044306.42774-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19nilfs2: fix shift-out-of-bounds/overflow in nilfs_sb2_bad_offset()Ryusuke Konishi1-4/+27
Patch series "nilfs2: fix UBSAN shift-out-of-bounds warnings on mount time". The first patch fixes a bug reported by syzbot, and the second one fixes the remaining bug of the same kind. Although they are triggered by the same super block data anomaly, I divided it into the above two because the details of the issues and how to fix it are different. Both are required to eliminate the shift-out-of-bounds issues at mount time. This patch (of 2): If the block size exponent information written in an on-disk superblock is corrupted, nilfs_sb2_bad_offset helper function can trigger shift-out-of-bounds warning followed by a kernel panic (if panic_on_warn is set): shift exponent 38983 is too large for 64-bit type 'unsigned long long' Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:151 [inline] __ubsan_handle_shift_out_of_bounds+0x33d/0x3b0 lib/ubsan.c:322 nilfs_sb2_bad_offset fs/nilfs2/the_nilfs.c:449 [inline] nilfs_load_super_block+0xdf5/0xe00 fs/nilfs2/the_nilfs.c:523 init_nilfs+0xb7/0x7d0 fs/nilfs2/the_nilfs.c:577 nilfs_fill_super+0xb1/0x5d0 fs/nilfs2/super.c:1047 nilfs_mount+0x613/0x9b0 fs/nilfs2/super.c:1317 ... In addition, since nilfs_sb2_bad_offset() performs multiplication without considering the upper bound, the computation may overflow if the disk layout parameters are not normal. This fixes these issues by inserting preliminary sanity checks for those parameters and by converting the comparison from one involving multiplication and left bit-shifting to one using division and right bit-shifting. Link: https://lkml.kernel.org/r/20221027044306.42774-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20221027044306.42774-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+e91619dd4c11c4960706@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19proc: give /proc/cmdline sizeAlexey Dobriyan1-1/+5
Most /proc files don't have length (in fstat sense). This leads to inefficiencies when reading such files with APIs commonly found in modern programming languages. They open file, then fstat descriptor, get st_size == 0 and either assume file is empty or start reading without knowing target size. cat(1) does OK because it uses large enough buffer by default. But naive programs copy-pasted from SO aren't: let mut f = std::fs::File::open("/proc/cmdline").unwrap(); let mut buf: Vec<u8> = Vec::new(); f.read_to_end(&mut buf).unwrap(); will result in openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32 read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32 read(3, "apper/fedora_localhost--live-roo"..., 64) = 64 read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116 read(3, "", 12) open/stat is OK, lseek looks silly but there are 3 unnecessary reads because Rust starts with 32 bytes per Vec<u8> and grows from there. In case of /proc/cmdline, the length is known precisely. Make variables readonly while I'm at it. P.S.: I tried to scp /proc/cpuinfo today and got empty file but this is separate story. Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19proc: report open files as size in stat() for /proc/pid/fdIvan Babrou1-0/+45
Many monitoring tools include open file count as a metric. Currently the only way to get this number is to enumerate the files in /proc/pid/fd. The problem with the current approach is that it does many things people generally don't care about when they need one number for a metric. In our tests for cadvisor, which reports open file counts per cgroup, we observed that reading the number of open files is slow. Out of 35.23% of CPU time spent in `proc_readfd_common`, we see 29.43% spent in `proc_fill_cache`, which is responsible for filling dentry info. Some of this extra time is spinlock contention, but it's a contention for the lock we don't want to take to begin with. We considered putting the number of open files in /proc/pid/status. Unfortunately, counting the number of fds involves iterating the open_files bitmap, which has a linear complexity in proportion with the number of open files (bitmap slots really, but it's close). We don't want to make /proc/pid/status any slower, so instead we put this info in /proc/pid/fd as a size member of the stat syscall result. Previously the reported number was zero, so there's very little risk of breaking anything, while still providing a somewhat logical way to count the open files with a fallback if it's zero. RFC for this patch included iterating open fds under RCU. Thanks to Frank Hofmann for the suggestion to use the bitmap instead. Previously: ``` $ sudo stat /proc/1/fd | head -n2 File: /proc/1/fd Size: 0 Blocks: 0 IO Block: 1024 directory ``` With this patch: ``` $ sudo stat /proc/1/fd | head -n2 File: /proc/1/fd Size: 65 Blocks: 0 IO Block: 1024 directory ``` Correctness check: ``` $ sudo ls /proc/1/fd | wc -l 65 ``` I added the docs for /proc/<pid>/fd while I'm at it. [ivan@cloudflare.com: use bitmap_weight() to count the bits] Link: https://lkml.kernel.org/r/20221018045844.37697-1-ivan@cloudflare.com [akpm@linux-foundation.org: include linux/bitmap.h for bitmap_weight()] [ivan@cloudflare.com: return errno from proc_fd_getattr() instead of setting negative size] Link: https://lkml.kernel.org/r/20221024173140.30673-1-ivan@cloudflare.com Link: https://lkml.kernel.org/r/20220922224027.59266-1-ivan@cloudflare.com Signed-off-by: Ivan Babrou <ivan@cloudflare.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: David Hildenbrand <david@redhat.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19proc/vmcore: fix potential memory leak in vmcore_init()Jianglei Nie1-0/+1
Patch series "Some minor cleanup patches resent". The first three patches trivial clean up patches. And for the patch "kexec: replace crash_mem_range with range", I got a ibm-p9wr ppc64le system to test, it works well. This patch (of 4): elfcorehdr_alloc() allocates a memory chunk for elfcorehdr_addr with kzalloc(). If is_vmcore_usable() returns false, elfcorehdr_addr is a predefined value. If parse_crash_elf_headers() gets some error and returns a negetive value, the elfcorehdr_addr should be released with elfcorehdr_free(). Fix it by calling elfcorehdr_free() when parse_crash_elf_headers() fails. Link: https://lkml.kernel.org/r/20220929042936.22012-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20220929042936.22012-2-bhe@redhat.com Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chen Lifu <chenlifu@huawei.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Li Chen <lchen@ambarella.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Russell King <linux@armlinux.org.uk> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19ocfs2/dlm: use bitmap API instead of hand-writing itJoseph Qi4-27/+24
Use bitmap_zero/bitmap_copy/bitmap_qeual directly for bitmap operations. Link: https://lkml.kernel.org/r/20221007124846.186453-3-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19ocfs2: use bitmap API in fill_node_mapJoseph Qi6-20/+16
Pass bits directly into fill_node_map helper and use bitmap API directly to simplify code. Link: https://lkml.kernel.org/r/20221007124846.186453-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19ocfs2/cluster: use bitmap API instead of hand-writing itJoseph Qi2-11/+11
Use bitmap_zero/bitmap_copy/bitmap_equal directly for bitmap operations. Link: https://lkml.kernel.org/r/20221007124846.186453-1-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-19core_pattern: add CPU specifierOleksandr Natalenko1-0/+5
Statistically, in a large deployment regular segfaults may indicate a CPU issue. Currently, it is not possible to find out what CPU the segfault happened on. There are at least two attempts to improve segfault logging with this regard, but they do not help in case the logs rotate. Hence, lets make sure it is possible to permanently record a CPU the task ran on using a new core_pattern specifier. Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> Suggested-by: Renaud Métrich <rmetrich@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Grzegorz Halat <ghalat@redhat.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Will Deacon <will@kernel.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds6-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of bugs, including some regressions, the most serious of which was one which would cause online resizes to fail with file systems with metadata checksums enabled. Also fix a warning caused by the newly added fortify string checker, plus some bugs that were found using fuzzed file systems" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix fortify warning in fs/ext4/fast_commit.c:1551 ext4: fix wrong return err in ext4_load_and_init_journal() ext4: fix warning in 'ext4_da_release_space' ext4: fix BUG_ON() when directory entry has invalid rec_len ext4: update the backup superblock's at the end of the online resize
2022-11-06Merge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-62/+105
Pull cifs fixes from Steve French: "One symlink handling fix and two fixes foir multichannel issues with iterating channels, including for oplock breaks when leases are disabled" * tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix use-after-free on the link name cifs: avoid unnecessary iteration of tcp sessions cifs: always iterate smb sessions using primary channel
2022-11-06ext4: fix fortify warning in fs/ext4/fast_commit.c:1551Theodore Ts'o1-2/+3
With the new fortify string system, rework the memcpy to avoid this warning: memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4) Cc: stable@kernel.org Fixes: 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix wrong return err in ext4_load_and_init_journal()Jason Yan1-1/+1
The return value is wrong in ext4_load_and_init_journal(). The local variable 'err' need to be initialized before goto out. The original code in __ext4_fill_super() is fine because it has two return values 'ret' and 'err' and 'ret' is initialized as -EINVAL. After we factor out ext4_load_and_init_journal(), this code is broken. So fix it by directly returning -EINVAL in the error handler path. Cc: stable@kernel.org Fixes: 9c1dd22d7422 ("ext4: factor out ext4_load_and_init_journal()") Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix warning in 'ext4_da_release_space'Ye Bin1-1/+2
Syzkaller report issue as follows: EXT4-fs (loop0): Free/Dirty block details EXT4-fs (loop0): free_blocks=0 EXT4-fs (loop0): dirty_blocks=0 EXT4-fs (loop0): Block reservation details EXT4-fs (loop0): i_reserved_data_blocks=0 EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks ------------[ cut here ]------------ WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524 Modules linked in: CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd6605b2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Workqueue: writeback wb_workfn (flush-7:0) RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528 RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296 RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00 RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000 RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5 R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000 R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740 FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461 mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589 ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852 do_writepages+0x3c3/0x680 mm/page-writeback.c:2469 __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587 writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870 wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044 wb_do_writeback fs/fs-writeback.c:2187 [inline] wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Above issue may happens as follows: ext4_da_write_begin ext4_create_inline_data ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS); ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA); __ext4_ioctl ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag ext4_da_write_begin ext4_da_convert_inline_data_to_extent ext4_da_write_inline_data_begin ext4_da_map_blocks ext4_insert_delayed_block if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk)) ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1 allocated = true; ext4_es_insert_delayed_block(inode, lblk, allocated); ext4_writepages mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1 ext4_es_remove_extent ext4_da_release_space(inode, reserved); if (unlikely(to_free > ei->i_reserved_data_blocks)) -> to_free == 1 but ei->i_reserved_data_blocks == 0 -> then trigger warning as above To solve above issue, forbid inode do migrate which has inline data. Cc: stable@kernel.org Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix BUG_ON() when directory entry has invalid rec_lenLuís Henriques1-1/+9
The rec_len field in the directory entry has to be a multiple of 4. A corrupted filesystem image can be used to hit a BUG() in ext4_rec_len_to_disk(), called from make_indexed_dir(). ------------[ cut here ]------------ kernel BUG at fs/ext4/ext4.h:2413! ... RIP: 0010:make_indexed_dir+0x53f/0x5f0 ... Call Trace: <TASK> ? add_dirent_to_buf+0x1b2/0x200 ext4_add_entry+0x36e/0x480 ext4_add_nondir+0x2b/0xc0 ext4_create+0x163/0x200 path_openat+0x635/0xe90 do_filp_open+0xb4/0x160 ? __create_object.isra.0+0x1de/0x3b0 ? _raw_spin_unlock+0x12/0x30 do_sys_openat2+0x91/0x150 __x64_sys_open+0x6c/0xa0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The fix simply adds a call to ext4_check_dir_entry() to validate the directory entry, returning -EFSCORRUPTED if the entry is invalid. CC: stable@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540 Signed-off-by: Luís Henriques <lhenriques@suse.de> Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-05cifs: fix use-after-free on the link nameChenXiaoSong2-6/+25
xfstests generic/011 reported use-after-free bug as follows: BUG: KASAN: use-after-free in __d_alloc+0x269/0x859 Read of size 15 at addr ffff8880078933a0 by task dirstress/952 CPU: 1 PID: 952 Comm: dirstress Not tainted 6.1.0-rc3+ #77 Call Trace: __dump_stack+0x23/0x29 dump_stack_lvl+0x51/0x73 print_address_description+0x67/0x27f print_report+0x3e/0x5c kasan_report+0x7b/0xa8 kasan_check_range+0x1b2/0x1c1 memcpy+0x22/0x5d __d_alloc+0x269/0x859 d_alloc+0x45/0x20c d_alloc_parallel+0xb2/0x8b2 lookup_open+0x3b8/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd Allocated by task 952: kasan_save_stack+0x1f/0x42 kasan_set_track+0x21/0x2a kasan_save_alloc_info+0x17/0x1d __kasan_kmalloc+0x7e/0x87 __kmalloc_node_track_caller+0x59/0x155 kstrndup+0x60/0xe6 parse_mf_symlink+0x215/0x30b check_mf_symlink+0x260/0x36a cifs_get_inode_info+0x14e1/0x1690 cifs_revalidate_dentry_attr+0x70d/0x964 cifs_revalidate_dentry+0x36/0x62 cifs_d_revalidate+0x162/0x446 lookup_open+0x36f/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 950: kasan_save_stack+0x1f/0x42 kasan_set_track+0x21/0x2a kasan_save_free_info+0x1c/0x34 ____kasan_slab_free+0x1c1/0x1d5 __kasan_slab_free+0xe/0x13 __kmem_cache_free+0x29a/0x387 kfree+0xd3/0x10e cifs_fattr_to_inode+0xb6a/0xc8c cifs_get_inode_info+0x3cb/0x1690 cifs_revalidate_dentry_attr+0x70d/0x964 cifs_revalidate_dentry+0x36/0x62 cifs_d_revalidate+0x162/0x446 lookup_open+0x36f/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd When opened a symlink, link name is from 'inode->i_link', but it may be reset to a new value when revalidate the dentry. If some processes get the link name on the race scenario, then UAF will happen on link name. Fix this by implementing 'get_link' interface to duplicate the link name. Fixes: 76894f3e2f71 ("cifs: improve symlink handling for smb2+") Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-05cifs: avoid unnecessary iteration of tcp sessionsShyam Prasad N3-51/+55
In a few places, we do unnecessary iterations of tcp sessions, even when the server struct is provided. The change avoids it and uses the server struct provided. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-05cifs: always iterate smb sessions using primary channelShyam Prasad N4-5/+25
smb sessions and tcons currently hang off primary channel only. Secondary channels have the lists as empty. Whenever there's a need to iterate sessions or tcons, we should use the list in the corresponding primary channel. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-05Merge tag 'xfs-6.1-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds29-386/+670
Pull xfs fixes from Darrick Wong: "Dave and I had thought that this would be a very quiet cycle, but we thought wrong. At first there were the usual trickle of minor bugfixes, but then Zorro pulled -rc1 and noticed complaints about the stronger memcpy checks w.r.t. flex arrays. Analyzing how to fix that revealed a bunch of validation gaps in validating ondisk log items during recovery, and then a customer hit an infinite loop in the refcounting code on a corrupt filesystem. So. This largeish batch of fixes addresses all those problems, I hope. Summary: - Fix a UAF bug during log recovery - Fix memory leaks when mount fails - Detect corrupt bestfree information in a directory block - Fix incorrect return value type for the dax page fault handlers - Fix fortify complaints about memcpy of xfs log item objects - Strengthen inadequate validation of recovered log items - Fix incorrectly declared flex array in EFI log item structs - Log corrupt log items for debugging purposes - Fix infinite loop problems in the refcount code if the refcount btree node block keys are corrupt - Fix infinite loop problems in the refcount code if the refcount btree records suffer MSB bitflips - Add more sanity checking to continued defer ops to prevent overflows from one AG to the next or off EOFS" * tag 'xfs-6.1-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: rename XFS_REFC_COW_START to _COWFLAG xfs: fix uninitialized list head in struct xfs_refcount_recovery xfs: fix agblocks check in the cow leftover recovery function xfs: check record domain when accessing refcount records xfs: remove XFS_FIND_RCEXT_SHARED and _COW xfs: refactor domain and refcount checking xfs: report refcount domain in tracepoints xfs: track cow/shared record domains explicitly in xfs_refcount_irec xfs: refactor refcount record usage in xchk_refcountbt_rec xfs: dump corrupt recovered log intent items to dmesg consistently xfs: move _irec structs to xfs_types.h xfs: actually abort log recovery on corrupt intent-done log items xfs: check deferred refcount op continuation parameters xfs: refactor all the EFI/EFD log item sizeof logic xfs: create a predicate to verify per-AG extents xfs: fix memcpy fortify errors in EFI log format copying xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents xfs: fix memcpy fortify errors in RUI log format copying xfs: fix memcpy fortify errors in CUI log format copying xfs: fix memcpy fortify errors in BUI log format copying ...
2022-11-03Merge tag 'fuse-fixes-6.1-rc4' of ↵Linus Torvalds2-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Fix two rarely triggered but long-standing issues" * tag 'fuse-fixes-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add file_modified() to fallocate fuse: fix readdir cache race
2022-11-03Merge tag 'for-6.1-rc3-tag' of ↵Linus Torvalds5-49/+91
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A batch of error handling fixes for resource leaks, fixes for nowait mode in combination with direct and buffered IO: - direct IO + dsync + nowait could miss a sync of the file after write, add handling for this combination - buffered IO + nowait should not fail with ENOSPC, only blocking IO could determine that - error handling fixes: - fix inode reserve space leak due to nowait buffered write - check the correct variable after allocation (direct IO submit) - fix inode list leak during backref walking - fix ulist freeing in self tests" * tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix inode reserve space leak due to nowait buffered write btrfs: fix nowait buffered write returning -ENOSPC btrfs: remove pointless and double ulist frees in error paths of qgroup tests btrfs: fix ulist leaks in error paths of qgroup self tests btrfs: fix inode list leak during backref walking at find_parent_nodes() btrfs: fix inode list leak during backref walking at resolve_indirect_refs() btrfs: fix lost file sync on direct IO write with nowait and dsync iocb btrfs: fix a memory allocation failure test in btrfs_submit_direct
2022-11-02Merge tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds18-75/+80
Pull NFS client bugfixes from Anna Schumaker: - Fix some coccicheck warnings - Avoid memcpy() run-time warning - Fix up various state reclaim / RECLAIM_COMPLETE errors - Fix a null pointer dereference in sysfs - Fix LOCK races - Fix gss_unwrap_resp_integ() crasher - Fix zero length clones - Fix memleak when allocate slot fails * tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs4: Fix kmemleak when allocate slot failed NFSv4.2: Fixup CLONE dest file size for zero-length count SUNRPC: Fix crasher in gss_unwrap_resp_integ() NFSv4: Retry LOCK on OLD_STATEID during delegation return SUNRPC: Fix null-ptr-deref when xps sysfs alloc failed NFSv4.1: We must always send RECLAIM_COMPLETE after a reboot NFSv4.1: Handle RECLAIM_COMPLETE trunking errors NFSv4: Fix a potential state reclaim deadlock NFS: Avoid memcpy() run-time warning for struct sockaddr overflows nfs: Remove redundant null checks before kfree
2022-11-02btrfs: fix inode reserve space leak due to nowait buffered writeFilipe Manana1-1/+3
During a nowait buffered write, if we fail to balance dirty pages we exit btrfs_buffered_write() without releasing the delalloc space reserved for an extent, resulting in leaking space from the inode's block reserve. So fix that by releasing the delalloc space for the extent when balancing dirty pages fails. Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com Fixes: 965f47aeb5de ("btrfs: make btrfs_buffered_write nowait compatible") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix nowait buffered write returning -ENOSPCFilipe Manana1-0/+3
If we are doing a buffered write in NOWAIT context and we can't reserve metadata space due to -ENOSPC, then we should return -EAGAIN so that we retry the write in a context allowed to block and do metadata reservation with flushing, which might succeed this time due to the allowed flushing. Returning -ENOSPC while in NOWAIT context simply makes some writes fail with -ENOSPC when they would likely succeed after switching from NOWAIT context to blocking context. That is unexpected behaviour and even fio complains about it with a warning like this: fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536 fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device The fio's job config is this: [global] bs=64K ioengine=io_uring iodepth=1 size=2236962133 nr_files=1 filesize=2236962133 direct=0 runtime=10 fallocate=posix io_size=2236962133 group_reporting time_based [task_0] rw=randwrite directory=/mnt/sdi numjobs=4 So fix this by returning -EAGAIN if we are in NOWAIT context and the metadata reservation failed with -ENOSPC. Fixes: 304e45acdb8f ("btrfs: plumb NOWAIT through the write path") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: remove pointless and double ulist frees in error paths of qgroup testsFilipe Manana1-12/+4
Several places in the qgroup self tests follow the pattern of freeing the ulist pointer they passed to btrfs_find_all_roots() if the call to that function returned an error. That is pointless because that function always frees the ulist in case it returns an error. Also In some places like at test_multiple_refs(), after a call to btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots" pointing to ulists that were freed, because btrfs_qgroup_account_extent() has freed those ulists, and if after that the next call to btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots" ulist again, resulting in a double free. So remove those calls to reduce the code size and avoid double ulist free in case of an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix ulist leaks in error paths of qgroup self testsFilipe Manana1-5/+15
In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests, if we fail to add the tree ref, remove the extent item or remove the extent ref, we are returning from the test function without freeing the "old_roots" ulist that was allocated by the previous calls to btrfs_find_all_roots(). Fix that by calling ulist_free() before returning. Fixes: 442244c96332 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix inode list leak during backref walking at find_parent_nodes()Filipe Manana1-1/+17
During backref walking, at find_parent_nodes(), if we are dealing with a data extent and we get an error while resolving the indirect backrefs, at resolve_indirect_refs(), or in the while loop that iterates over the refs in the direct refs rbtree, we end up leaking the inode lists attached to the direct refs we have in the direct refs rbtree that were not yet added to the refs ulist passed as argument to find_parent_nodes(). Since they were not yet added to the refs ulist and prelim_release() does not free the lists, on error the caller can only free the lists attached to the refs that were added to the refs ulist, all the remaining refs get their inode lists never freed, therefore leaking their memory. Fix this by having prelim_release() always free any attached inode list to each ref found in the rbtree, and have find_parent_nodes() set the ref's inode list to NULL once it transfers ownership of the inode list to a ref added to the refs ulist passed to find_parent_nodes(). Fixes: 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix inode list leak during backref walking at resolve_indirect_refs()Filipe Manana1-19/+17
During backref walking, at resolve_indirect_refs(), if we get an error we jump to the 'out' label and call ulist_free() on the 'parents' ulist, which frees all the elements in the ulist - however that does not free any inode lists that may be attached to elements, through the 'aux' field of a ulist node, so we end up leaking lists if we have any attached to the unodes. Fix this by calling free_leaf_list() instead of ulist_free() when we exit from resolve_indirect_refs(). The static function free_leaf_list() is moved up for this to be possible and it's slightly simplified by removing unnecessary code. Fixes: 3301958b7c1d ("Btrfs: add inodes before dropping the extent lock in find_all_leafs") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02Merge tag 'nfsd-6.1-3' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a loop that occurs when using multiple net namespaces * tag 'nfsd-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix net-namespace logic in __nfsd_file_cache_purge
2022-11-02nfsd: fix net-namespace logic in __nfsd_file_cache_purgeJeff Layton1-3/+2
If the namespace doesn't match the one in "net", then we'll continue, but that doesn't cause another rhashtable_walk_next call, so it will loop infinitely. Fixes: ce502f81ba88 ("NFSD: Convert the filecache to use rhashtable") Reported-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/ltp/Y1%2FP8gDAcWC%2F+VR3@pevik/ Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-10-31Merge tag 'for-6.1-rc3-tag' of ↵Linus Torvalds10-40/+73
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes and regression fixes: - fix a corner case when handling tree-mod-log chagnes in reallocated notes - fix crash on raid0 filesystems created with <5.4 mkfs.btrfs that could lead to division by zero - add missing super block checksum verification after thawing filesystem - handle one more case in send when dealing with orphan files - fix parameter type mismatch for generation when reading dentry - improved error handling in raid56 code - better struct bio packing after recent cleanups" * tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't use btrfs_chunk::sub_stripes from disk btrfs: fix type of parameter generation in btrfs_get_dentry btrfs: send: fix send failure of a subcase of orphan inodes btrfs: make thaw time super block check to also verify checksum btrfs: fix tree mod log mishandling of reallocated nodes btrfs: reorder btrfs_bio for better packing btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed btrfs: raid56: properly handle the error when unable to find the missing stripe
2022-10-31xfs: rename XFS_REFC_COW_START to _COWFLAGDarrick J. Wong3-6/+6
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and a bit flag, even though it's *only* a bit flag. Rename the variable to reflect its nature and update the cast target since we're not supposed to be comparing it to xfs_agblock_t now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix uninitialized list head in struct xfs_refcount_recoveryDarrick J. Wong1-4/+6
We're supposed to initialize the list head of an object before adding it to another list. Fix that, and stop using the kmem_{alloc,free} calls from the Irix days. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix agblocks check in the cow leftover recovery functionDarrick J. Wong1-1/+3
As we've seen, refcount records use the upper bit of the rc_startblock field to ensure that all the refcount records are at the right side of the refcount btree. This works because an AG is never allowed to have more than (1U << 31) blocks in it. If we ever encounter a filesystem claiming to have that many blocks, we absolutely do not want reflink touching it at all. However, this test at the start of xfs_refcount_recover_cow_leftovers is slightly incorrect -- it /should/ be checking that agblocks isn't larger than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the constant is never large enough to conflict with that CoW flag. Note that the V5 superblock verifier has not historically rejected filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this ended up in the COW recovery routine. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check record domain when accessing refcount recordsDarrick J. Wong2-14/+43
Now that we've separated the startblock and CoW/shared extent domain in the incore refcount record structure, check the domain whenever we retrieve a record to ensure that it's still in the domain that we want. Depending on the circumstances, a change in domain either means we're done processing or that we've found a corruption and need to fail out. The refcount check in xchk_xref_is_cow_staging is redundant since _get_rec has done that for a long time now, so we can get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: remove XFS_FIND_RCEXT_SHARED and _COWDarrick J. Wong1-31/+17
Now that we have an explicit enum for shared and CoW staging extents, we can get rid of the old FIND_RCEXT flags. Omit a couple of conversions that disappear in the next patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor domain and refcount checkingDarrick J. Wong3-10/+17
Create a helper function to ensure that CoW staging extent records have a single refcount and that shared extent records have more than 1 refcount. We'll put this to more use in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: report refcount domain in tracepointsDarrick J. Wong2-9/+43
Now that we've broken out the startblock and shared/cow domain in the incore refcount extent record structure, update the tracepoints to report the domain. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: track cow/shared record domains explicitly in xfs_refcount_irecDarrick J. Wong5-67/+151
Just prior to committing the reflink code into upstream, the xfs maintainer at the time requested that I find a way to shard the refcount records into two domains -- one for records tracking shared extents, and a second for tracking CoW staging extents. The idea here was to minimize mount time CoW reclamation by pushing all the CoW records to the right edge of the keyspace, and it was accomplished by setting the upper bit in rc_startblock. We don't allow AGs to have more than 2^31 blocks, so the bit was free. Unfortunately, this was a very late addition to the codebase, so most of the refcount record processing code still treats rc_startblock as a u32 and pays no attention to whether or not the upper bit (the cow flag) is set. This is a weakness is theoretically exploitable, since we're not fully validating the incoming metadata records. Fuzzing demonstrates practical exploits of this weakness. If the cow flag of a node block key record is corrupted, a lookup operation can go to the wrong record block and start returning records from the wrong cow/shared domain. This causes the math to go all wrong (since cow domain is still implicit in the upper bit of rc_startblock) and we can crash the kernel by tricking xfs into jumping into a nonexistent AG and tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL. To fix this, start tracking the domain as an explicit part of struct xfs_refcount_irec, adjust all refcount functions to check the domain of a returned record, and alter the function definitions to accept them where necessary. Found by fuzzing keys[2].cowflag = add in xfs/464. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor refcount record usage in xchk_refcountbt_recDarrick J. Wong1-30/+24
Consolidate the open-coded xfs_refcount_irec fields into an actual struct and use the existing _btrec_to_irec to decode the ondisk record. This will reduce code churn in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: move _irec structs to xfs_types.hDarrick J. Wong2-20/+20
Structure definitions for incore objects do not belong in the ondisk format header. Move them to the incore types header where they belong. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check deferred refcount op continuation parametersDarrick J. Wong1-2/+36
If we're in the middle of a deferred refcount operation and decide to roll the transaction to avoid overflowing the transaction space, we need to check the new agbno/aglen parameters that we're about to record in the new intent. Specifically, we need to check that the new extent is completely within the filesystem, and that continuation does not put us into a different AG. If the keys of a node block are wrong, the lookup to resume an xfs_refcount_adjust_extents operation can put us into the wrong record block. If this happens, we might not find that we run out of aglen at an exact record boundary, which will cause the loop control to do the wrong thing. The previous patch should take care of that problem, but let's add this extra sanity check to stop corruption problems sooner than later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: create a predicate to verify per-AG extentsDarrick J. Wong7-26/+24
Create a predicate function to verify that a given agbno/blockcount pair fit entirely within a single allocation group and don't suffer mathematical overflows. Refactor the existng open-coded logic; we're going to add more calls to this function in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: make sure aglen never goes negative in xfs_refcount_adjust_extentsDarrick J. Wong1-3/+17
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen such that the end of the range would not be in the middle of a refcount record. If this is no longer the case, something is seriously wrong with the btree. Bail out with a corruption error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>