summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-05-29ext2: use offset_in_page() instead of open-coding it as subtractionAl Viro1-8/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29ext2_rename(): set_link and delete_entry may failAl Viro2-25/+16
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-16ext2: Add direct-io trace pointsRitesh Harjani (IBM)4-2/+113
This patch adds the trace point to ext2 direct-io apis in fs/ext2/file.c Here is how the output looks like a.out-467865 [006] 6758.170968: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0 a.out-467865 [006] 6758.171061: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT|WRITE aio 1 ret -529 kworker/3:153-444162 [003] 6758.171252: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0 a.out-468222 [001] 6761.628924: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 1 ret 0 a.out-468222 [001] 6761.629063: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT aio 1 ret -529 a.out-468428 [005] 6763.937454: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468428 [005] 6763.937829: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468428 [005] 6763.937847: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096 a.out-468609 [000] 6765.702878: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468609 [000] 6765.703243: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096 Reported-and-tested-by: Disha Goel <disgoel@linux.ibm.com> [Need to add CFLAGS_trace for fixing unable to find trace file problem] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <b8b0897fa2b273a448d7b4ba7317357ac73c08bc.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2: Move direct-io to use iomapRitesh Harjani (IBM)3-19/+150
This patch converts ext2 direct-io path to iomap interface. - This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK from ext2_iomap_begin(), in case if the write is done on a hole. - This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of a partial write or if any error is detected in ext2_iomap_end(). We try to return -ENOTBLK in such cases. - For any unaligned or extending DIO writes, we pass IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes. - For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because otherwise with dsync writes on devices that support FUA, generic_write_sync won't be called and we might miss inode metadata updates. - Since ext2 already now uses _nolock vartiant of sync write. Hence there is no inode lock problem with iomap in this patch. - ext2_iomap_ops are now being shared by DIO, DAX & fiemap path Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <610b672a52f2a7ff6dc550fd14d0f995806232a5.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2: Use generic_buffers_fsync() implementationRitesh Harjani (IBM)1-1/+2
Next patch converts ext2 to use iomap interface for DIO. iomap layer can call generic_write_sync() -> ext2_fsync() from iomap_dio_complete while still holding the inode_lock(). Now writeback from other paths doesn't need inode_lock(). It seems there is also no need of an inode_lock() for sync_mapping_buffers(). It uses it's own mapping->private_lock for it's buffer list handling. Hence this patch is in preparation to move ext2 to iomap. This uses generic_buffers_fsync() which does not take any inode_lock() in ext2_fsync(). Tested-by: Disha Goel <disgoel@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <76d206a464574ff91db25bc9e43479b51ca7e307.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext4: Use generic_buffers_fsync_noflush() implementationRitesh Harjani (IBM)1-17/+16
ext4 when got converted to iomap for dio, it copied __generic_file_fsync implementation to avoid taking inode_lock in order to avoid any deadlock (since iomap takes an inode_lock while calling generic_write_sync()). The previous patch already added generic_buffers_fsync*() which does not take any inode_lock(). Hence kill the redundant code and use generic_buffers_fsync_noflush() function instead. Tested-by: Disha Goel <disgoel@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <b43d4bb4403061ed86510c9587673e30a461ba14.1682069716.git.ritesh.list@gmail.com>
2023-05-16fs/buffer.c: Add generic_buffers_fsync*() implementationRitesh Harjani (IBM)1-0/+70
Some of the higher layers like iomap takes inode_lock() when calling generic_write_sync(). Also writeback already happens from other paths without inode lock, so it's difficult to say that we really need sync_mapping_buffers() to take any inode locking here. Having said that, let's add generic_buffers_fsync/_noflush() implementation in buffer.c with no inode_lock/unlock() for now so that filesystems like ext2 and ext4's nojournal mode can use it. Ext4 when got converted to iomap for direct-io already copied it's own variant of __generic_file_fsync() without lock. This patch adds generic_buffers_fsync() & generic_buffers_fsync_noflush() implementations for use in filesystems like ext2 & ext4 respectively. Later we can review other filesystems as well to see if we can make generic_buffers_fsync/_noflush() which does not take any inode_lock() as the default path. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <d573408ac8408627d23a3d2d166e748c172c4c9e.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2/dax: Fix ext2_setsize when len is page alignedRitesh Harjani (IBM)1-3/+2
PAGE_ALIGN(x) macro gives the next highest value which is multiple of pagesize. But if x is already page aligned then it simply returns x. So, if x passed is 0 in dax_zero_range() function, that means the length gets passed as 0 to ->iomap_begin(). In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on here in ext2_get_blocks(). BUG_ON(maxblocks == 0); Instead we should be calling dax_truncate_page() here which takes care of it. i.e. it only calls dax_zero_range if the offset is not page/block aligned. This can be easily triggered with following on fsdax mounted pmem device. dd if=/dev/zero of=file count=1 bs=512 truncate -s 0 file [79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff) [93.793207] ------------[ cut here ]------------ [93.795102] kernel BUG at fs/ext2/inode.c:637! [93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI [93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139 [93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610 <...> [93.835298] Call Trace: [93.836253] <TASK> [93.837103] ? lock_acquire+0xf8/0x110 [93.838479] ? d_lookup+0x69/0xd0 [93.839779] ext2_iomap_begin+0xa7/0x1c0 [93.841154] iomap_iter+0xc7/0x150 [93.842425] dax_zero_range+0x6e/0xa0 [93.843813] ext2_setsize+0x176/0x1b0 [93.845164] ext2_setattr+0x151/0x200 [93.846467] notify_change+0x341/0x4e0 [93.847805] ? lock_acquire+0xf8/0x110 [93.849143] ? do_truncate+0x74/0xe0 [93.850452] ? do_truncate+0x84/0xe0 [93.851739] do_truncate+0x84/0xe0 [93.852974] do_sys_ftruncate+0x2b4/0x2f0 [93.854404] do_syscall_64+0x3f/0x90 [93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc CC: stable@vger.kernel.org Fixes: 2aa3048e03d3 ("iomap: switch iomap_zero_range to use iomap_iter") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <046a58317f29d9603d1068b2bbae47c2332c17ae.1682069716.git.ritesh.list@gmail.com>
2023-05-14Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds13-104/+269
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some ext4 bug fixes (mostly to address Syzbot reports)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: bail out of ext4_xattr_ibody_get() fails for any reason ext4: add bounds checking in get_max_inline_xattr_value_size() ext4: add indication of ro vs r/w mounts in the mount message ext4: fix deadlock when converting an inline directory in nojournal mode ext4: improve error recovery code paths in __ext4_remount() ext4: improve error handling from ext4_dirhash() ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled ext4: check iomap type only if ext4_iomap_begin() does not fail ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum ext4: fix data races when using cached status extents ext4: avoid deadlock in fs reclaim with page writeback ext4: fix invalid free tracking in ext4_xattr_move_to_block() ext4: remove a BUG_ON in ext4_mb_release_group_pa() ext4: allow ext4_get_group_info() to fail ext4: fix lockdep warning when enabling MMP ext4: fix WARNING in mb_find_extent
2023-05-14ext4: bail out of ext4_xattr_ibody_get() fails for any reasonTheodore Ts'o1-1/+1
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any reason, it's best if we just fail as opposed to stumbling on, especially if the failure is EFSCORRUPTED. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: add bounds checking in get_max_inline_xattr_value_size()Theodore Ts'o1-1/+11
Normally the extended attributes in the inode body would have been checked when the inode is first opened, but if someone is writing to the block device while the file system is mounted, it's possible for the inode table to get corrupted. Add bounds checking to avoid reading beyond the end of allocated memory if this happens. Reported-by: syzbot+1966db24521e5f6e23f7@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=1966db24521e5f6e23f7 Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: add indication of ro vs r/w mounts in the mount messageTheodore Ts'o1-4/+6
Whether the file system is mounted read-only or read/write is more important than the quota mode, which we are already printing. Add the ro vs r/w indication since this can be helpful in debugging problems from the console log. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: fix deadlock when converting an inline directory in nojournal modeTheodore Ts'o1-1/+2
In no journal mode, ext4_finish_convert_inline_dir() can self-deadlock by calling ext4_handle_dirty_dirblock() when it already has taken the directory lock. There is a similar self-deadlock in ext4_incvert_inline_data_nolock() for data files which we'll fix at the same time. A simple reproducer demonstrating the problem: mke2fs -Fq -t ext2 -O inline_data -b 4k /dev/vdc 64 mount -t ext4 -o dirsync /dev/vdc /vdc cd /vdc mkdir file0 cd file0 touch file0 touch file1 attr -s BurnSpaceInEA -V abcde . touch supercalifragilisticexpialidocious Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230507021608.1290720-1-tytso@mit.edu Reported-by: syzbot+91dccab7c64e2850a4e5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=ba84cc80a9491d65416bc7877e1650c87530fe8a Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: improve error recovery code paths in __ext4_remount()Theodore Ts'o1-3/+10
If there are failures while changing the mount options in __ext4_remount(), we need to restore the old mount options. This commit fixes two problem. The first is there is a chance that we will free the old quota file names before a potential failure leading to a use-after-free. The second problem addressed in this commit is if there is a failed read/write to read-only transition, if the quota has already been suspended, we need to renable quota handling. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: improve error handling from ext4_dirhash()Theodore Ts'o2-17/+42
The ext4_dirhash() will *almost* never fail, especially when the hash tree feature was first introduced. However, with the addition of support of encrypted, casefolded file names, that function can most certainly fail today. So make sure the callers of ext4_dirhash() properly check for failures, and reflect the errors back up to their callers. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu Reported-by: syzbot+394aa8a792cb99dbc837@syzkaller.appspotmail.com Reported-by: syzbot+344aaa8697ebd232bfc8@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=db56459ea4ac4a676ae4b4678f633e55da005a9b Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabledTheodore Ts'o1-1/+5
When a file system currently mounted read/only is remounted read/write, if we clear the SB_RDONLY flag too early, before the quota is initialized, and there is another process/thread constantly attempting to create a directory, it's possible to trigger the WARN_ON_ONCE(dquot_initialize_needed(inode)); in ext4_xattr_block_set(), with the following stack trace: WARNING: CPU: 0 PID: 5338 at fs/ext4/xattr.c:2141 ext4_xattr_block_set+0x2ef2/0x3680 RIP: 0010:ext4_xattr_block_set+0x2ef2/0x3680 fs/ext4/xattr.c:2141 Call Trace: ext4_xattr_set_handle+0xcd4/0x15c0 fs/ext4/xattr.c:2458 ext4_initxattrs+0xa3/0x110 fs/ext4/xattr_security.c:44 security_inode_init_security+0x2df/0x3f0 security/security.c:1147 __ext4_new_inode+0x347e/0x43d0 fs/ext4/ialloc.c:1324 ext4_mkdir+0x425/0xce0 fs/ext4/namei.c:2992 vfs_mkdir+0x29d/0x450 fs/namei.c:4038 do_mkdirat+0x264/0x520 fs/namei.c:4061 __do_sys_mkdirat fs/namei.c:4076 [inline] __se_sys_mkdirat fs/namei.c:4074 [inline] __x64_sys_mkdirat+0x89/0xa0 fs/namei.c:4074 Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu Reported-by: syzbot+6385d7d3065524c5ca6d@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=6513f6cb5cd6b5fc9f37e3bb70d273b94be9c34c Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: check iomap type only if ext4_iomap_begin() does not failBaokun Li1-1/+1
When ext4_iomap_overwrite_begin() calls ext4_iomap_begin() map blocks may fail for some reason (e.g. memory allocation failure, bare disk write), and later because "iomap->type ! = IOMAP_MAPPED" triggers WARN_ON(). When ext4 iomap_begin() returns an error, it is normal that the type of iomap->type may not match the expectation. Therefore, we only determine if iomap->type is as expected when ext4_iomap_begin() is executed successfully. Cc: stable@kernel.org Reported-by: syzbot+08106c4b7d60702dbc14@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/00000000000015760b05f9b4eee9@google.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230505132429.714648-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csumTudor Ambarus1-4/+2
When modifying the block device while it is mounted by the filesystem, syzbot reported the following: BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58 Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586 CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc96618275234 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:306 print_report+0x107/0x1f0 mm/kasan/report.c:417 kasan_report+0xcd/0x100 mm/kasan/report.c:517 crc16+0x206/0x280 lib/crc16.c:58 ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187 ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210 ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline] ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173 ext4_remove_blocks fs/ext4/extents.c:2527 [inline] ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline] ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958 ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416 ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342 ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622 notify_change+0xe50/0x1100 fs/attr.c:482 do_truncate+0x200/0x2f0 fs/open.c:65 handle_truncate fs/namei.c:3216 [inline] do_open fs/namei.c:3561 [inline] path_openat+0x272b/0x2dd0 fs/namei.c:3714 do_filp_open+0x264/0x4f0 fs/namei.c:3741 do_sys_openat2+0x124/0x4e0 fs/open.c:1310 do_sys_open fs/open.c:1326 [inline] __do_sys_creat fs/open.c:1402 [inline] __se_sys_creat fs/open.c:1396 [inline] __x64_sys_creat+0x11f/0x160 fs/open.c:1396 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f72f8a8c0c9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280 RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000 Replace le16_to_cpu(sbi->s_es->s_desc_size) with sbi->s_desc_size It reduces ext4's compiled text size, and makes the code more efficient (we remove an extra indirect reference and a potential byte swap on big endian systems), and there is no downside. It also avoids the potential KASAN / syzkaller failure, as a bonus. Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff42 Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f3 Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/ Fixes: 717d50e4971b ("Ext4: Uninitialized Block Groups") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: fix data races when using cached status extentsJan Kara1-17/+13
When using cached extent stored in extent status tree in tree->cache_es another process holding ei->i_es_lock for reading can be racing with us setting new value of tree->cache_es. If the compiler would decide to refetch tree->cache_es at an unfortunate moment, it could result in a bogus in_range() check. Fix the possible race by using READ_ONCE() when using tree->cache_es only under ei->i_es_lock for reading. Cc: stable@kernel.org Reported-by: syzbot+4a03518df1e31b537066@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000d3b33905fa0fd4a6@google.com Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230504125524.10802-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: avoid deadlock in fs reclaim with page writebackJan Kara3-13/+40
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to avoid races with switching of journalled data flag or inode format. This lock can however cause a deadlock like: CPU0 CPU1 ext4_writepages() percpu_down_read(sbi->s_writepages_rwsem); ext4_change_inode_journal_flag() percpu_down_write(sbi->s_writepages_rwsem); - blocks, all readers block from now on ext4_do_writepages() ext4_init_io_end() kmem_cache_zalloc(io_end_cachep, GFP_KERNEL) fs_reclaim frees dentry... dentry_unlink_inode() iput() - last ref => iput_final() - inode dirty => write_inode_now()... ext4_writepages() tries to acquire sbi->s_writepages_rwsem and blocks forever Make sure we cannot recurse into filesystem reclaim from writeback code to avoid the deadlock. Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: fix invalid free tracking in ext4_xattr_move_to_block()Theodore Ts'o1-2/+3
In ext4_xattr_move_to_block(), the value of the extended attribute which we need to move to an external block may be allocated by kvmalloc() if the value is stored in an external inode. So at the end of the function the code tried to check if this was the case by testing entry->e_value_inum. However, at this point, the pointer to the xattr entry is no longer valid, because it was removed from the original location where it had been stored. So we could end up calling kvfree() on a pointer which was not allocated by kvmalloc(); or we could also potentially leak memory by not freeing the buffer when it should be freed. Fix this by storing whether it should be freed in a separate variable. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430160426.581366-1-tytso@mit.edu Link: https://syzkaller.appspot.com/bug?id=5c2aee8256e30b55ccf57312c16d88417adbd5e1 Link: https://syzkaller.appspot.com/bug?id=41a6b5d4917c0412eb3b3c3c604965bed7d7420b Reported-by: syzbot+64b645917ce07d89bde5@syzkaller.appspotmail.com Reported-by: syzbot+0d042627c4f2ad332195@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: remove a BUG_ON in ext4_mb_release_group_pa()Theodore Ts'o1-1/+5
If a malicious fuzzer overwrites the ext4 superblock while it is mounted such that the s_first_data_block is set to a very large number, the calculation of the block group can underflow, and trigger a BUG_ON check. Change this to be an ext4_warning so that we don't crash the kernel. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: allow ext4_get_group_info() to failTheodore Ts'o5-29/+82
Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2023-05-13Merge tag 'for-6.4-rc1-tag' of ↵Linus Torvalds11-33/+102
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: - fix incorrect number of bitmap entries for space cache if loading is interrupted by some error - fix backref walking, this breaks a mode of LOGICAL_INO_V2 ioctl that is used in deduplication tools - zoned mode fixes: - properly finish zone reserved for relocation - correctly calculate super block zone end on ZNS - properly initialize new extent buffer for redirty - make mount option clear_cache work with block-group-tree, to rebuild free-space-tree instead of temporarily disabling it that would lead to a forced read-only mount - fix alignment check for offset when printing extent item * tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make clear_cache mount option to rebuild FST without disabling it btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add btrfs: zoned: fix full zone super block reading on ZNS btrfs: zoned: zone finish data relocation BG with last IO btrfs: fix backref walking not returning all inode refs btrfs: fix space cache inconsistency after error loading it from disk btrfs: print-tree: parent bytenr must be aligned to sector size
2023-05-13Merge tag '6.4-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-2/+28
Pull cifs client fixes from Steve French: - fix for copy_file_range bug for very large files that are multiples of rsize - do not ignore "isolated transport" flag if set on share - set rasize default better - three fixes related to shutdown and freezing (fixes 4 xfstests, and closes deferred handles faster in some places that were missed) * tag '6.4-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: release leases for deferred close handles when freezing smb3: fix problem remounting a share after shutdown SMB3: force unmount was failing to close deferred close files smb3: improve parallel reads of large files do not reuse connection if share marked as isolated cifs: fix pcchunk length type in smb2_copychunk_range
2023-05-13Merge tag 'vfs/v6.4-rc1/pipe' of ↵Linus Torvalds1-2/+4
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fix from Christian Brauner: "During the pipe nonblock rework the check for both O_NONBLOCK and IOCB_NOWAIT was dropped. Both checks need to be performed to ensure that files without O_NONBLOCK but IOCB_NOWAIT don't block when writing to or reading from a pipe. This just contains the fix adding the check for IOCB_NOWAIT back in" * tag 'vfs/v6.4-rc1/pipe' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: pipe: check for IOCB_NOWAIT alongside O_NONBLOCK
2023-05-12pipe: check for IOCB_NOWAIT alongside O_NONBLOCKJens Axboe1-2/+4
Pipe reads or writes need to enable nonblocking attempts, if either O_NONBLOCK is set on the file, or IOCB_NOWAIT is set in the iocb being passed in. The latter isn't currently true, ensure we check for both before waiting on data or space. Fixes: afed6271f5b0 ("pipe: set FMODE_NOWAIT on pipes") Signed-off-by: Jens Axboe <axboe@kernel.dk> Message-Id: <e5946d67-4e5e-b056-ba80-656bab12d9f6@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-12Merge tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds14-63/+65
Pull xfs bug fixes from Dave Chinner: "Largely minor bug fixes and cleanups, th emost important of which are probably the fixes for regressions in the extent allocation code: - fixes for inode garbage collection shutdown racing with work queue updates - ensure inodegc workers run on the CPU they are supposed to - disable counter scrubbing until we can exclusively freeze the filesystem from the kernel - regression fixes for new allocation related bugs - a couple of minor cleanups" * tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_inodegc_stop racing with mod_delayed_work xfs: disable reaping in fscounters scrub xfs: check that per-cpu inodegc workers actually run on that cpu xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately xfs: fix negative array access in xfs_getbmap xfs: don't allocate into the data fork for an unshare request xfs: flush dirty data and drain directios before scrubbing cow fork xfs: set bnobt/cntbt numrecs correctly when formatting new AGs xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
2023-05-11cifs: release leases for deferred close handles when freezingSteve French1-0/+15
We should not be caching closed files when freeze is invoked on an fs (so we can release resources more gracefully). Fixes xfstests generic/068 generic/390 generic/491 Reviewed-by: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-11Merge tag 'fsnotify_for_v6.4-rc2' of ↵Linus Torvalds1-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull inotify fix from Jan Kara: "A fix for possibly reporting invalid watch descriptor with inotify event" * tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: inotify: Avoid reporting event with invalid wd
2023-05-11Merge tag 'gfs2-v6.3-fix' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: - Fix a NULL pointer dereference when mounting corrupted filesystems * tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't deref jdesc in evict
2023-05-10gfs2: Don't deref jdesc in evictBob Peterson1-0/+8
On corrupt gfs2 file systems the evict code can try to reference the journal descriptor structure, jdesc, after it has been freed and set to NULL. The sequence of events is: init_journal() ... fail_jindex: gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL if (gfs2_holder_initialized(&ji_gh)) gfs2_glock_dq_uninit(&ji_gh); fail: iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode evict() gfs2_evict_inode() evict_linked_inode() ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks); <------references the now freed/zeroed sd_jdesc pointer. The call to gfs2_trans_begin is done because the truncate_inode_pages call can cause gfs2 events that require a transaction, such as removing journaled data (jdata) blocks from the journal. This patch fixes the problem by adding a check for sdp->sd_jdesc to function gfs2_evict_inode. In theory, this should only happen to corrupt gfs2 file systems, when gfs2 detects the problem, reports it, then tries to evict all the system inodes it has read in up to that point. Reported-by: Yang Lan <lanyang0908@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-05-10btrfs: make clear_cache mount option to rebuild FST without disabling itQu Wenruo4-11/+70
Previously clear_cache mount option would simply disable free-space-tree feature temporarily then re-enable it to rebuild the whole free space tree. But this is problematic for block-group-tree feature, as we have an artificial dependency on free-space-tree feature. If we go the existing method, after clearing the free-space-tree feature, we would flip the filesystem to read-only mode, as we detect a super block write with block-group-tree but no free-space-tree feature. This patch would change the behavior by properly rebuilding the free space tree without disabling this feature, thus allowing clear_cache mount option to work with block group tree. Now we can mount a filesystem with block-group-tree feature and clear_mount option: $ mkfs.btrfs -O block-group-tree /dev/test/scratch1 -f $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache $ sudo dmesg -t | head -n 5 BTRFS info (device dm-1): force clearing of disk cache BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): auto enabling async discard BTRFS info (device dm-1): rebuilding free space tree BTRFS info (device dm-1): checking UUID tree CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_addChristoph Hellwig1-2/+2
btrfs_redirty_list_add zeroes the buffer data and sets the EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus header. But it does that after already marking the buffer dirty, which means that writeback could already be looking at the buffer. Switch the order of operations around so that the buffer is only marked dirty when we're ready to write it. Fixes: d3575156f662 ("btrfs: zoned: redirty released extent buffers") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10btrfs: zoned: fix full zone super block reading on ZNSNaohiro Aota1-4/+3
When both of the superblock zones are full, we need to check which superblock is newer. The calculation of last superblock position is wrong as it does not consider zone_capacity and uses the length. Fixes: 9658b72ef300 ("btrfs: zoned: locate superblock position using zone capacity") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10btrfs: zoned: zone finish data relocation BG with last IONaohiro Aota1-0/+3
For data block groups, we zone finish a zone (or, just deactivate it) when seeing the last IO in btrfs_finish_ordered_io(). That is only called for IOs using ZONE_APPEND, but we use a regular WRITE command for data relocation IOs. Detect it and call btrfs_zone_finish_endio() properly. Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09btrfs: fix backref walking not returning all inode refsFilipe Manana3-10/+17
When using the logical to ino ioctl v2, if the flag to ignore offsets of file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the backref walking code ends up not returning references for all file offsets of an inode that point to the given logical bytenr. This happens since kernel 6.2, commit 6ce6ba534418 ("btrfs: use a single argument for extent offset in backref walking functions") because: 1) It mistakenly skipped the search for file extent items in a leaf that point to the target extent if that flag is given. Instead it should only skip the filtering done by check_extent_in_eb() - that is, it should not avoid the calls to that function (or find_extent_in_eb(), which uses it). 2) It was also not building a list of inode extent elements (struct extent_inode_elem) if we have multiple inode references for an extent when the ignore offset flag is given to the logical to ino ioctl - it would leave a single element, only the last one that was found. These stem from the confusing old interface for backref walking functions where we had an extent item offset argument that was a pointer to a u64 and another boolean argument that indicated if the offset should be ignored, but the pointer could be NULL. That NULL case is used by relocation, qgroup extent accounting and fiemap, simply to avoid building the inode extent list for each reference, as it's not necessary for those use cases and therefore avoids memory allocations and some computations. Fix this by adding a boolean argument to the backref walk context structure to indicate that the inode extent list should not be built, make relocation set that argument to true and fix the backref walking logic to skip the calls to check_extent_in_eb() and find_extent_in_eb() only if this new argument is true, instead of 'ignore_extent_item_pos' being true. A test case for fstests will be added soon, to provide cover not only for these cases but to the logical to ino ioctl in general as well, as currently we do not have a test case for it. Reported-by: Vladimir Panteleev <git@vladimir.panteleev.md> Link: https://lore.kernel.org/linux-btrfs/CAHhfkvwo=nmzrJSqZ2qMfF-rZB-ab6ahHnCD_sq9h4o8v+M7QQ@mail.gmail.com/ Fixes: 6ce6ba534418 ("btrfs: use a single argument for extent offset in backref walking functions") CC: stable@vger.kernel.org # 6.2+ Tested-by: Vladimir Panteleev <git@vladimir.panteleev.md> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09btrfs: fix space cache inconsistency after error loading it from diskFilipe Manana1-3/+4
When loading a free space cache from disk, at __load_free_space_cache(), if we fail to insert a bitmap entry, we still increment the number of total bitmaps in the btrfs_free_space_ctl structure, which is incorrect since we failed to add the bitmap entry. On error we then empty the cache by calling __btrfs_remove_free_space_cache(), which will result in getting the total bitmaps counter set to 1. A failure to load a free space cache is not critical, so if a failure happens we just rebuild the cache by scanning the extent tree, which happens at block-group.c:caching_thread(). Yet the failure will result in having the total bitmaps of the btrfs_free_space_ctl always bigger by 1 then the number of bitmap entries we have. So fix this by having the total bitmaps counter be incremented only if we successfully added the bitmap entry. Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache") Reviewed-by: Anand Jain <anand.jain@oracle.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09btrfs: print-tree: parent bytenr must be aligned to sector sizeAnastasia Belova1-3/+3
Check nodesize to sectorsize in alignment check in print_extent_item. The comment states that and this is correct, similar check is done elsewhere in the functions. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: ea57788eb76d ("btrfs: require only sector size alignment for parent eb bytenr") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anastasia Belova <abelova@astralinux.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09nfs: fix another case of NULL/IS_ERR confusion wrt folio pointersLinus Torvalds1-1/+1
Dan has been improving on the smatch error pointer checks, and pointed at another case where the __filemap_get_folio() conversion to error pointers had been overlooked. This time because it was hidden behind the filemap_grab_folio() helper function that is a wrapper around it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-09Merge tag 'for-6.4-rc1-tag' of ↵Linus Torvalds7-9/+55
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix backward leaf iteration which could possibly return the same key - fix assertion when device add and balance race for exclusive operation - fix regression when freeing device, state tree would leak after device replace - fix attempt to clear space cache v1 when block-group-tree is enabled - fix potential i_size corruption when encoded write races with send v2 and enabled no-holes (the race is hard to hit though, the window is a few instructions wide) - fix wrong bitmap API use when checking empty zones, parameters were swapped but not causing a bug due to other code - prevent potential qgroup leak if subvolume create does not commit transaction (which is pending in the development queue) - error handling and reporting: - abort transaction when sibling keys check fails for leaves - print extent buffers when sibling keys check fails * tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't free qgroup space unless specified btrfs: fix encoded write i_size corruption with no-holes btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones btrfs: properly reject clear_cache and v1 cache for block-group-tree btrfs: print extent buffers when sibling keys check fails btrfs: abort transaction when sibling keys check fails for leaves btrfs: fix leak of source device allocation state after device replace btrfs: fix assertion of exclop condition when starting balance btrfs: fix btrfs_prev_leaf() to not return the same key twice
2023-05-09smb3: fix problem remounting a share after shutdownSteve French1-0/+7
xfstests generic/392 showed a problem where even after a shutdown call was made on a mount, we would still attempt to use the (now inaccessible) superblock if another mount was attempted for the same share. Reported-by: David Howells <dhowells@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Fixes: 087f757b0129 ("cifs: add shutdown support") Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-09SMB3: force unmount was failing to close deferred close filesSteve French1-0/+1
In investigating a failure with xfstest generic/392 it was noticed that mounts were reusing a superblock that should already have been freed. This turned out to be related to deferred close files keeping a reference count until the closetimeo expired. Currently the only way an fs knows that mount is beginning is when force unmount is called, but when this, ie umount_begin(), is called all deferred close files on the share (tree connection) should be closed immediately (unless shared by another mount) to avoid using excess resources on the server and to avoid reusing a superblock which should already be freed. In umount_begin, close all deferred close handles for that share if this is the last mount using that share on this client (ie send the SMB3 close request over the wire for those that have been already closed by the app but that we have kept a handle lease open for and have not sent closes to the server for yet). Reported-by: David Howells <dhowells@redhat.com> Acked-by: Bharath SM <bharathsm@microsoft.com> Cc: <stable@vger.kernel.org> Fixes: 78c09634f7dc ("Cifs: Fix kernel oops caused by deferred close for files.") Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-09smb3: improve parallel reads of large filesSteve French1-1/+1
rasize (ra_pages) should be set higher than read size by default to allow parallel reads when reading large files in order to improve performance (otherwise there is much dead time on the network when doing readahead of large files). Default rasize to twice readsize. Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-08do not reuse connection if share marked as isolatedSteve French1-0/+3
"SHAREFLAG_ISOLATED_TRANSPORT" indicates that we should not reuse the socket for this share (for future mounts). Mark the socket as server->nosharesock if share flags returned include SHAREFLAG_ISOLATED_TRANSPORT. See MS-SMB2 MS-SMB2 2.2.10 and 3.2.5.5 Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-08cifs: fix pcchunk length type in smb2_copychunk_rangePawel Witek1-1/+1
Change type of pcchunk->Length from u32 to u64 to match smb2_copychunk_range arguments type. Fixes the problem where performing server-side copy with CIFS_IOC_COPYCHUNK_FILE ioctl resulted in incomplete copy of large files while returning -EINVAL. Fixes: 9bf0c9cd4314 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files") Cc: <stable@vger.kernel.org> Signed-off-by: Pawel Witek <pawel.ireneusz.witek@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-08ext4: fix lockdep warning when enabling MMPJan Kara1-9/+21
When we enable MMP in ext4_multi_mount_protect() during mount or remount, we end up calling sb_start_write() from write_mmp_block(). This triggers lockdep warning because freeze protection ranks above s_umount semaphore we are holding during mount / remount. The problem is harmless because we are guaranteed the filesystem is not frozen during mount / remount but still let's fix the warning by not grabbing freeze protection from ext4_multi_mount_protect(). Cc: stable@kernel.org Reported-by: syzbot+6b7df7d5506b32467149@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=ab7e5b6f400b7778d46f01841422e5718fb81843 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230411121019.21940-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-08ext4: fix WARNING in mb_find_extentYe Bin1-0/+25
Syzbot found the following issue: EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support! EXT4-fs (loop0): orphan cleanup on readonly fs ------------[ cut here ]------------ WARNING: CPU: 1 PID: 5067 at fs/ext4/mballoc.c:1869 mb_find_extent+0x8a1/0xe30 Modules linked in: CPU: 1 PID: 5067 Comm: syz-executor307 Not tainted 6.2.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:mb_find_extent+0x8a1/0xe30 fs/ext4/mballoc.c:1869 RSP: 0018:ffffc90003c9e098 EFLAGS: 00010293 RAX: ffffffff82405731 RBX: 0000000000000041 RCX: ffff8880783457c0 RDX: 0000000000000000 RSI: 0000000000000041 RDI: 0000000000000040 RBP: 0000000000000040 R08: ffffffff82405723 R09: ffffed10053c9402 R10: ffffed10053c9402 R11: 1ffff110053c9401 R12: 0000000000000000 R13: ffffc90003c9e538 R14: dffffc0000000000 R15: ffffc90003c9e2cc FS: 0000555556665300(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056312f6796f8 CR3: 0000000022437000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_mb_complex_scan_group+0x353/0x1100 fs/ext4/mballoc.c:2307 ext4_mb_regular_allocator+0x1533/0x3860 fs/ext4/mballoc.c:2735 ext4_mb_new_blocks+0xddf/0x3db0 fs/ext4/mballoc.c:5605 ext4_ext_map_blocks+0x1868/0x6880 fs/ext4/extents.c:4286 ext4_map_blocks+0xa49/0x1cc0 fs/ext4/inode.c:651 ext4_getblk+0x1b9/0x770 fs/ext4/inode.c:864 ext4_bread+0x2a/0x170 fs/ext4/inode.c:920 ext4_quota_write+0x225/0x570 fs/ext4/super.c:7105 write_blk fs/quota/quota_tree.c:64 [inline] get_free_dqblk+0x34a/0x6d0 fs/quota/quota_tree.c:130 do_insert_tree+0x26b/0x1aa0 fs/quota/quota_tree.c:340 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 dq_insert_tree fs/quota/quota_tree.c:401 [inline] qtree_write_dquot+0x3b6/0x530 fs/quota/quota_tree.c:420 v2_write_dquot+0x11b/0x190 fs/quota/quota_v2.c:358 dquot_acquire+0x348/0x670 fs/quota/dquot.c:444 ext4_acquire_dquot+0x2dc/0x400 fs/ext4/super.c:6740 dqget+0x999/0xdc0 fs/quota/dquot.c:914 __dquot_initialize+0x3d0/0xcf0 fs/quota/dquot.c:1492 ext4_process_orphan+0x57/0x2d0 fs/ext4/orphan.c:329 ext4_orphan_cleanup+0xb60/0x1340 fs/ext4/orphan.c:474 __ext4_fill_super fs/ext4/super.c:5516 [inline] ext4_fill_super+0x81cd/0x8700 fs/ext4/super.c:5644 get_tree_bdev+0x400/0x620 fs/super.c:1282 vfs_get_tree+0x88/0x270 fs/super.c:1489 do_new_mount+0x289/0xad0 fs/namespace.c:3145 do_mount fs/namespace.c:3488 [inline] __do_sys_mount fs/namespace.c:3697 [inline] __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Add some debug information: mb_find_extent: mb_find_extent block=41, order=0 needed=64 next=0 ex=0/41/1@3735929054 64 64 7 block_bitmap: ff 3f 0c 00 fc 01 00 00 d2 3d 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Acctually, blocks per group is 64, but block bitmap indicate at least has 128 blocks. Now, ext4_validate_block_bitmap() didn't check invalid block's bitmap if set. To resolve above issue, add check like fsck "Padding at end of block bitmap is not set". Cc: stable@kernel.org Reported-by: syzbot+68223fe9f6c95ad43bed@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116020015.1506120-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-07Merge tag '6.4-rc-smb3-client-fixes-part2' of ↵Linus Torvalds15-223/+376
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "smb3 client fixes, mostly DFS or reconnect related: - Two DFS connection sharing fixes - DFS refresh fix - Reconnect fix - Two potential use after free fixes - Also print prefix patch in mount debug msg - Two small cleanup fixes" * tag '6.4-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: cifs: Remove unneeded semicolon cifs: fix sharing of DFS connections cifs: avoid potential races when handling multiple dfs tcons cifs: protect access of TCP_Server_Info::{origin,leaf}_fullpath cifs: fix potential race when tree connecting ipc cifs: fix potential use-after-free bugs in TCP_Server_Info::hostname cifs: print smb3_fs_context::source when mounting cifs: protect session status check in smb2_reconnect() SMB3.1.1: correct definition for app_instance_id create contexts
2023-05-06Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of ↵Linus Torvalds3-8/+20
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Five hotfixes. Three are cc:stable, two pertain to merge window changes" * tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: afs: fix the afs_dir_get_folio return value nilfs2: do not write dirty data after degenerating to read-only mm: do not reclaim private data from pinned page nilfs2: fix infinite loop in nilfs_mdt_get_block() mm/mmap/vma_merge: always check invariants