summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-10-26nfsd: Fix a memory leak in an error handling pathChristophe JAILLET1-1/+3
[ Upstream commit fd1ef88049de09bc70d60b549992524cfc0e66ff ] If this memdup_user() call fails, the memory allocated in a previous call a few lines above should be freed. Otherwise it leaks. Fixes: 6ee95d1c8991 ("nfsd: add support for upcall version 2") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26userfaultfd: open userfaultfds with O_RDONLYOndrej Mosnacek1-2/+2
[ Upstream commit abec3d015fdfb7c63105c7e1c956188bf381aa55 ] Since userfaultfd doesn't implement a write operation, it is more appropriate to open it read-only. When userfaultfds are opened read-write like it is now, and such fd is passed from one process to another, SELinux will check both read and write permissions for the target process, even though it can't actually do any write operation on the fd later. Inspired by the following bug report, which has hit the SELinux scenario described above: https://bugzilla.redhat.com/show_bug.cgi?id=1974559 Reported-by: Robert O'Callahan <roc@ocallahan.org> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26smb3: must initialize two ACL struct fields to zeroSteve French1-1/+2
commit f09bd695af3b8ab46fc24e5d6954a24104c38387 upstream. Coverity spotted that we were not initalizing Stbz1 and Stbz2 to zero in create_sd_buf. Addresses-Coverity: 1513848 ("Uninitialized scalar variable") Cc: <stable@vger.kernel.org> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: update 'state->fc_regions_size' after successful memory allocationYe Bin1-4/+5
commit 27cd49780381c6ccbf248798e5e8fd076200ffba upstream. To avoid to 'state->fc_regions_size' mismatch with 'state->fc_regions' when fail to reallocate 'fc_reqions',only update 'state->fc_regions_size' after 'state->fc_regions' is allocated successfully. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: fix potential memory leak in ext4_fc_record_regions()Ye Bin1-6/+8
commit 7069d105c1f15c442b68af43f7fde784f3126739 upstream. As krealloc may return NULL, in this case 'state->fc_regions' may not be freed by krealloc, but 'state->fc_regions' already set NULL. Then will lead to 'state->fc_regions' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: fix potential memory leak in ext4_fc_record_modified_inode()Ye Bin1-3/+5
commit 9305721a309fa1bd7c194e0d4a2335bf3b29dca4 upstream. As krealloc may return NULL, in this case 'state->fc_modified_inodes' may not be freed by krealloc, but 'state->fc_modified_inodes' already set NULL. Then will lead to 'state->fc_modified_inodes' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: fix miss release buffer head in ext4_fc_write_inodeYe Bin1-6/+9
commit ccbf8eeb39f2ff00b54726a2b20b35d788c4ecb5 upstream. In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc', after use it miss release 'iloc.bh'. So just release 'iloc.bh' before 'ext4_fc_write_inode' return. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: place buffer head allocation before handle startJinke Han1-0/+7
commit d1052d236eddf6aa851434db1897b942e8db9921 upstream. In our product environment, we encounter some jbd hung waiting handles to stop while several writters were doing memory reclaim for buffer head allocation in delay alloc write path. Ext4 do buffer head allocation with holding transaction handle which may be blocked too long if the reclaim works not so smooth. According to our bcc trace, the reclaim time in buffer head allocation can reach 258s and the jbd transaction commit also take almost the same time meanwhile. Except for these extreme cases, we often see several seconds delays for cgroup memory reclaim on our servers. This is more likely to happen considering docker environment. One thing to note, the allocation of buffer heads is as often as page allocation or more often when blocksize less than page size. Just like page cache allocation, we should also place the buffer head allocation before startting the handle. Cc: stable@kernel.org Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodateZhang Yi1-11/+5
commit 0b73284c564d3ae4feef4bc920292f004acf4980 upstream. Recently we notice that ext4 filesystem would occasionally fail to read metadata from disk and report error message, but the disk and block layer looks fine. After analyse, we lockon commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages"). It provide a migration method for the bdev, we could move page that has buffers without extra users now, but it lock the buffers on the page, which breaks the fragile metadata read operation on ext4 filesystem, ext4_read_bh_lock() was copied from ll_rw_block(), it depends on the assumption of that locked buffer means it is under IO. So it just trylock the buffer and skip submit IO if it lock failed, after wait_on_buffer() we conclude IO error because the buffer is not uptodate. This issue could be easily reproduced by add some delay just after buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do fsstress on ext4 filesystem. EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193: comm fsstress: reading directory lblock 0 EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334: comm fsstress: reading directory lblock 0 Fix it by removing the trylock logic in ext4_read_bh_lock(), just lock the buffer and submit IO if it's not uptodate, and also leave over readahead helper. Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220831074629.3755110-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: don't increase iversion counter for ea_inodesLukas Czerner1-1/+6
commit 50f094a5580e6297bf10a807d16f0ee23fa576cf upstream. ea_inodes are using i_version for storing part of the reference count so we really need to leave it alone. The problem can be reproduced by xfstest ext4/026 when iversion is enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL inodes in ext4_mark_iloc_dirty(). Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: fix check for block being out of directory sizeJan Kara1-1/+1
commit 61a1d87a324ad5e3ed27c6699dfc93218fcf3201 upstream. The check in __ext4_read_dirblock() for block being outside of directory size was wrong because it compared block number against directory size in bytes. Fix it. Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size") CVE: CVE-2022-1184 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: make ext4_lazyinit_thread freezableLalith Rajendran1-0/+1
commit 3b575495ab8dbb4dbe85b4ac7f991693c3668ff5 upstream. ext4_lazyinit_thread is not set freezable. Hence when the thread calls try_to_freeze it doesn't freeze during suspend and continues to send requests to the storage during suspend, resulting in suspend failures. Cc: stable@kernel.org Signed-off-by: Lalith Rajendran <lalithkraj@google.com> Link: https://lore.kernel.org/r/20220818214049.1519544-1-lalithkraj@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: fix null-ptr-deref in ext4_write_infoBaokun Li1-1/+1
commit f9c1f248607d5546075d3f731e7607d5571f2b60 upstream. I caught a null-ptr-deref bug as follows: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 1 PID: 1589 Comm: umount Not tainted 5.10.0-02219-dirty #339 RIP: 0010:ext4_write_info+0x53/0x1b0 [...] Call Trace: dquot_writeback_dquots+0x341/0x9a0 ext4_sync_fs+0x19e/0x800 __sync_filesystem+0x83/0x100 sync_filesystem+0x89/0xf0 generic_shutdown_super+0x79/0x3e0 kill_block_super+0xa1/0x110 deactivate_locked_super+0xac/0x130 deactivate_super+0xb6/0xd0 cleanup_mnt+0x289/0x400 __cleanup_mnt+0x16/0x20 task_work_run+0x11c/0x1c0 exit_to_user_mode_prepare+0x203/0x210 syscall_exit_to_user_mode+0x5b/0x3a0 do_syscall_64+0x59/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ================================================================== Above issue may happen as follows: ------------------------------------- exit_to_user_mode_prepare task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_block_super generic_shutdown_super shrink_dcache_for_umount dentry = sb->s_root sb->s_root = NULL <--- Here set NULL sync_filesystem __sync_filesystem sb->s_op->sync_fs > ext4_sync_fs dquot_writeback_dquots sb->dq_op->write_info > ext4_write_info ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2) d_inode(sb->s_root) s_root->d_inode <--- Null pointer dereference To solve this problem, we use ext4_journal_start_sb directly to avoid s_root being used. Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220805123947.565152-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26ext4: avoid crash when inline data creation follows DIO writeJan Kara1-0/+6
commit 4bb26f2885ac6930984ee451b952c5a6042f2c0e upstream. When inode is created and written to using direct IO, there is nothing to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets truncated later to say 1 byte and written using normal write, we will try to store the data as inline data. This confuses the code later because the inode now has both normal block and inline data allocated and the confusion manifests for example as: kernel BUG at fs/ext4/inode.c:2721! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 359 Comm: repro Not tainted 5.19.0-rc8-00001-g31ba1e3b8305-dirty #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:ext4_writepages+0x363d/0x3660 RSP: 0018:ffffc90000ccf260 EFLAGS: 00010293 RAX: ffffffff81e1abcd RBX: 0000008000000000 RCX: ffff88810842a180 RDX: 0000000000000000 RSI: 0000008000000000 RDI: 0000000000000000 RBP: ffffc90000ccf650 R08: ffffffff81e17d58 R09: ffffed10222c680b R10: dfffe910222c680c R11: 1ffff110222c680a R12: ffff888111634128 R13: ffffc90000ccf880 R14: 0000008410000000 R15: 0000000000000001 FS: 00007f72635d2640(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565243379180 CR3: 000000010aa74000 CR4: 0000000000150eb0 Call Trace: <TASK> do_writepages+0x397/0x640 filemap_fdatawrite_wbc+0x151/0x1b0 file_write_and_wait_range+0x1c9/0x2b0 ext4_sync_file+0x19e/0xa00 vfs_fsync_range+0x17b/0x190 ext4_buffered_write_iter+0x488/0x530 ext4_file_write_iter+0x449/0x1b90 vfs_write+0xbcd/0xf40 ksys_write+0x198/0x2c0 __x64_sys_write+0x7b/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fix the problem by clearing EXT4_STATE_MAY_INLINE_DATA when we are doing direct IO write to a file. Cc: stable@kernel.org Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+bd13648a53ed6933ca49@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Tested-by: Tadeusz Struk<tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220727155753.13969-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26jbd2: add miss release buffer head in fc_do_one_pass()Ye Bin1-0/+1
commit dfff66f30f66b9524b661f311bbed8ff3d2ca49f upstream. In fc_do_one_pass() miss release buffer head after use which will lead to reference count leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220917093805.1782845-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26jbd2: fix potential use-after-free in jbd2_fc_wait_bufsYe Bin1-3/+3
commit 243d1a5d505d0b0460c9af0ad56ed4a56ef0bebd upstream. In 'jbd2_fc_wait_bufs' use 'bh' after put buffer head reference count which may lead to use-after-free. So judge buffer if uptodate before put buffer head reference count. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26jbd2: fix potential buffer head reference count leakYe Bin1-1/+7
commit e0d5fc7a6d80ac2406c7dfc6bb625201d0250a8a upstream. As in 'jbd2_fc_wait_bufs' if buffer isn't uptodate, will return -EIO without update 'journal->j_fc_off'. But 'jbd2_fc_release_bufs' will release buffer head from ‘j_fc_off - 1’ if 'bh' is NULL will terminal release which will lead to buffer head buffer head reference count leak. To solve above issue, update 'journal->j_fc_off' before return -EIO. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26jbd2: wake up journal waiters in FIFO order, not LIFOAndrew Perepechko2-4/+4
commit 34fc8768ec6089565d6d73bad26724083cecf7bd upstream. LIFO wakeup order is unfair and sometimes leads to a journal user not being able to get a journal handle for hundreds of transactions in a row. FIFO wakeup can make things more fair. Cc: stable@kernel.org Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220907165959.1137482-1-alexey.lyashkov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26f2fs: fix to do sanity check on summary infoChao Yu2-4/+21
commit c6ad7fd16657ebd34a87a97d9588195aae87597d upstream. As Wenqing Liu reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216456 BUG: KASAN: use-after-free in recover_data+0x63ae/0x6ae0 [f2fs] Read of size 4 at addr ffff8881464dcd80 by task mount/1013 CPU: 3 PID: 1013 Comm: mount Tainted: G W 6.0.0-rc4 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Call Trace: dump_stack_lvl+0x45/0x5e print_report.cold+0xf3/0x68d kasan_report+0xa8/0x130 recover_data+0x63ae/0x6ae0 [f2fs] f2fs_recover_fsync_data+0x120d/0x1fc0 [f2fs] f2fs_fill_super+0x4665/0x61e0 [f2fs] mount_bdev+0x2cf/0x3b0 legacy_get_tree+0xed/0x1d0 vfs_get_tree+0x81/0x2b0 path_mount+0x47e/0x19d0 do_mount+0xce/0xf0 __x64_sys_mount+0x12c/0x1a0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause is: in fuzzed image, SSA table is corrupted: ofs_in_node is larger than ADDRS_PER_PAGE(), result in out-of-range access on 4k-size page. - recover_data - do_recover_data - check_index_in_prev_nodes - f2fs_data_blkaddr This patch adds sanity check on summary info in recovery and GC flow in where the flows rely on them. After patch: [ 29.310883] F2FS-fs (loop0): Inconsistent ofs_in_node:65286 in summary, ino:0, nid:6, max:1018 Cc: stable@vger.kernel.org Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26f2fs: fix to do sanity check on destination blkaddr during recoveryChao Yu3-1/+21
commit 0ef4ca04a3f9223ff8bc440041c524b2123e09a3 upstream. As Wenqing Liu reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216456 loop5: detected capacity change from 0 to 131072 F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1 F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0 F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1 F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0 F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1 F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0 F2FS-fs (loop5): Bitmap was wrongly set, blk:5634 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1013 at fs/f2fs/segment.c:2198 RIP: 0010:update_sit_entry+0xa55/0x10b0 [f2fs] Call Trace: <TASK> f2fs_do_replace_block+0xa98/0x1890 [f2fs] f2fs_replace_block+0xeb/0x180 [f2fs] recover_data+0x1a69/0x6ae0 [f2fs] f2fs_recover_fsync_data+0x120d/0x1fc0 [f2fs] f2fs_fill_super+0x4665/0x61e0 [f2fs] mount_bdev+0x2cf/0x3b0 legacy_get_tree+0xed/0x1d0 vfs_get_tree+0x81/0x2b0 path_mount+0x47e/0x19d0 do_mount+0xce/0xf0 __x64_sys_mount+0x12c/0x1a0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd If we enable CONFIG_F2FS_CHECK_FS config, it will trigger a kernel panic instead of warning. The root cause is: in fuzzed image, SIT table is inconsistent with inode mapping table, result in triggering such warning during SIT table update. This patch introduces a new flag DATA_GENERIC_ENHANCE_UPDATE, w/ this flag, data block recovery flow can check destination blkaddr's validation in SIT table, and skip f2fs_replace_block() to avoid inconsistent status. Cc: stable@vger.kernel.org Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26f2fs: increase the limit for reserve_rootJaegeuk Kim1-2/+2
commit da35fe96d12d15779f3cb74929b7ed03941cf983 upstream. This patch increases the threshold that limits the reserved root space from 0.2% to 12.5% by using simple shift operation. Typically Android sets 128MB, but if the storage capacity is 32GB, 0.2% which is around 64MB becomes too small. Let's relax it. Cc: stable@vger.kernel.org Reported-by: Aran Dalton <arda@allwinnertech.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26btrfs: fix race between quota enable and quota rescan ioctlFilipe Manana1-0/+15
commit 331cd9461412e103d07595a10289de90004ac890 upstream. When enabling quotas, at btrfs_quota_enable(), after committing the transaction, we change fs_info->quota_root to point to the quota root we created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try to start the qgroup rescan worker, first by initializing it with a call to qgroup_rescan_init() - however if that fails we end up freeing the quota root but we leave fs_info->quota_root still pointing to it, this can later result in a use-after-free somewhere else. We have previously set the flags BTRFS_FS_QUOTA_ENABLED and BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at btrfs_quota_enable(), which is possible if someone already called the quota rescan ioctl, and therefore started the rescan worker. So fix this by ignoring an -EINPROGRESS and asserting we can't get any other error. Reported-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26quota: Check next/prev free block number after reading from quota fileZhihao Cheng1-0/+38
commit 6c8ea8b8cd4722efd419f91ca46a2dc81b7d89a3 upstream. Following process: Init: v2_read_file_info: <3> dqi_free_blk 0 dqi_free_entry 5 dqi_blks 6 Step 1. chown bin f_a -> dquot_acquire -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk write_blk(info->dqi_blocks) // info->dqi_blocks = 6, failure. The content in physical block (corresponding to blk 6) is random. Step 2. chown root f_a -> dquot_transfer -> dqput_all -> dqput -> ext4_release_dquot -> v2_release_dquot -> qtree_delete_dquot: dquot_release remove_tree free_dqentry put_free_dqblk(6) info->dqi_free_blk = blk // info->dqi_free_blk = 6 Step 3. drop cache (buffer head for block 6 is released) Step 4. chown bin f_b -> dquot_acquire -> commit_dqblk -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk dh = (struct qt_disk_dqdbheader *)buf blk = info->dqi_free_blk // 6 ret = read_blk(info, blk, buf) // The content of buf is random info->dqi_free_blk = le32_to_cpu(dh->dqdh_next_free) // random blk Step 5. chown bin f_c -> notify_change -> ext4_setattr -> dquot_transfer: dquot = dqget -> acquire_dquot -> ext4_acquire_dquot -> dquot_acquire -> commit_dqblk -> v2_write_dquot -> dq_insert_tree: do_insert_tree find_free_dqentry get_free_dqblk blk = info->dqi_free_blk // If blk < 0 and blk is not an error code, it will be returned as dquot transfer_to[USRQUOTA] = dquot // A random negative value __dquot_transfer(transfer_to) dquot_add_inodes(transfer_to[cnt]) spin_lock(&dquot->dq_dqb_lock) // page fault , which will lead to kernel page fault: Quota error (device sda): qtree_write_dquot: Error -8000 occurred while creating quota BUG: unable to handle page fault for address: ffffffffffffe120 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 5974 Comm: chown Not tainted 6.0.0-rc1-00004 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:_raw_spin_lock+0x3a/0x90 Call Trace: dquot_add_inodes+0x28/0x270 __dquot_transfer+0x377/0x840 dquot_transfer+0xde/0x540 ext4_setattr+0x405/0x14d0 notify_change+0x68e/0x9f0 chown_common+0x300/0x430 __x64_sys_fchownat+0x29/0x40 In order to avoid accessing invalid quota memory address, this patch adds block number checking of next/prev free block read from quota file. Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216372 Fixes: 1da177e4c3f4152 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220923134555.2623931-2-chengzhihao1@huawei.com Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26fs: dlm: handle -EBUSY first in lock arg validationAlexander Aring1-8/+8
commit 44637ca41d551d409a481117b07fa209b330fca9 upstream. During lock arg validation, first check for -EBUSY cases, then for -EINVAL cases. The -EINVAL checks look at lkb state variables which are not stable when an lkb is busy and would cause an -EBUSY result, e.g. lkb->lkb_grmode. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26fs: dlm: fix race between test_bit() and queue_work()Alexander Aring1-2/+4
commit eef6ec9bf390e836a6c4029f3620fe49528aa1fe upstream. This patch fixes a race by using ls_cb_mutex around the bit operations and conditional code blocks for LSFL_CB_DELAY. The function dlm_callback_stop() expects to stop all callbacks and flush all currently queued onces. The set_bit() is not enough because there can still be queue_work() after the workqueue was flushed. To avoid queue_work() after set_bit(), surround both by ls_cb_mutex. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26cifs: Fix the error length of VALIDATE_NEGOTIATE_INFO messageZhang Xiaoxu1-2/+2
commit e98ecc6e94f4e6d21c06660b0f336df02836694f upstream. Commit d5c7076b772a ("smb3: add smb3.1.1 to default dialect list") extend the dialects from 3 to 4, but forget to decrease the extended length when specific the dialect, then the message length is larger than expected. This maybe leak some info through network because not initialize the message body. After apply this patch, the VALIDATE_NEGOTIATE_INFO message length is reduced from 28 bytes to 26 bytes. Fixes: d5c7076b772a ("smb3: add smb3.1.1 to default dialect list") Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Cc: <stable@vger.kernel.org> Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26cifs: destage dirty pages before re-reading them for cache=noneRonnie Sahlberg1-0/+9
commit bb44c31cdcac107344dd2fcc3bd0504a53575c51 upstream. This is the opposite case of kernel bugzilla 216301. If we mmap a file using cache=none and then proceed to update the mmapped area these updates are not reflected in a later pread() of that part of the file. To fix this we must first destage any dirty pages in the range before we allow the pread() to proceed. Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-17Revert "fs: check FMODE_LSEEK to control internal pipe splicing"Sasha Levin1-4/+6
This reverts commit fd0a6e99b61e6c08fa5cf585d54fd956f70c73a6. Which was upstream commit 97ef77c52b789ec1411d360ed99dca1efe4b2c81. The commit is missing dependencies and breaks NFS tests, remove it for now. Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-15fs: fix UAF/GPF bug in nilfs_mdt_destroyDongliang Mu1-4/+3
commit 2e488f13755ffbb60f307e991b27024716a33b29 upstream. In alloc_inode, inode_init_always() could return -ENOMEM if security_inode_alloc() fails, which causes inode->i_private uninitialized. Then nilfs_is_metadata_file_inode() returns true and nilfs_free_inode() wrongly calls nilfs_mdt_destroy(), which frees the uninitialized inode->i_private and leads to crashes(e.g., UAF/GPF). Fix this by moving security_inode_alloc just prior to this_cpu_inc(nr_inodes) Link: https://lkml.kernel.org/r/CAFcO6XOcf1Jj2SeGt=jJV59wmhESeSKpfR0omdFRq+J9nD1vfQ@mail.gmail.com Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: Jiacheng Xu <stitch@zju.edu.cn> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15ceph: don't truncate file in atomic_openHu Weiwen1-3/+7
commit 7cb9994754f8a36ae9e5ec4597c5c4c2d6c03832 upstream. Clear O_TRUNC from the flags sent in the MDS create request. `atomic_open' is called before permission check. We should not do any modification to the file here. The caller will do the truncation afterward. Fixes: 124e68e74099 ("ceph: file operations") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> [Xiubo: fixed a trivial conflict for 5.10 backport] Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failureRyusuke Konishi1-4/+10
commit 723ac751208f6d6540191689cfbf6c77135a7a1b upstream. If creation or finalization of a checkpoint fails due to anomalies in the checkpoint metadata on disk, a kernel warning is generated. This patch replaces the WARN_ONs by nilfs_error, so that a kernel, booted with panic_on_warn, does not panic. A nilfs_error is appropriate here to handle the abnormal filesystem condition. This also replaces the detected error codes with an I/O error so that neither of the internal error codes is returned to callers. Link: https://lkml.kernel.org/r/20220929123330.19658-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+fbb3e0b24e8dae5a16ee@syzkaller.appspotmail.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15nilfs2: fix leak of nilfs_root in case of writer thread creation failureRyusuke Konishi1-4/+3
commit d0d51a97063db4704a5ef6bc978dddab1636a306 upstream. If nilfs_attach_log_writer() failed to create a log writer thread, it frees a data structure of the log writer without any cleanup. After commit e912a5b66837 ("nilfs2: use root object to get ifile"), this causes a leak of struct nilfs_root, which started to leak an ifile metadata inode and a kobject on that struct. In addition, if the kernel is booted with panic_on_warn, the above ifile metadata inode leak will cause the following panic when the nilfs2 kernel module is removed: kmem_cache_destroy nilfs2_inode_cache: Slab cache still has objects when called from nilfs_destroy_cachep+0x16/0x3a [nilfs2] WARNING: CPU: 8 PID: 1464 at mm/slab_common.c:494 kmem_cache_destroy+0x138/0x140 ... RIP: 0010:kmem_cache_destroy+0x138/0x140 Code: 00 20 00 00 e8 a9 55 d8 ff e9 76 ff ff ff 48 8b 53 60 48 c7 c6 20 70 65 86 48 c7 c7 d8 69 9c 86 48 8b 4c 24 28 e8 ef 71 c7 00 <0f> 0b e9 53 ff ff ff c3 48 81 ff ff 0f 00 00 77 03 31 c0 c3 53 48 ... Call Trace: <TASK> ? nilfs_palloc_freev.cold.24+0x58/0x58 [nilfs2] nilfs_destroy_cachep+0x16/0x3a [nilfs2] exit_nilfs_fs+0xa/0x1b [nilfs2] __x64_sys_delete_module+0x1d9/0x3a0 ? __sanitizer_cov_trace_pc+0x1a/0x50 ? syscall_trace_enter.isra.19+0x119/0x190 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ... </TASK> Kernel panic - not syncing: panic_on_warn set ... This patch fixes these issues by calling nilfs_detach_log_writer() cleanup function if spawning the log writer thread fails. Link: https://lkml.kernel.org/r/20221007085226.57667-1-konishi.ryusuke@gmail.com Fixes: e912a5b66837 ("nilfs2: use root object to get ifile") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+7381dc4ad60658ca4c05@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15nilfs2: fix use-after-free bug of struct nilfs_rootRyusuke Konishi1-1/+16
commit d325dc6eb763c10f591c239550b8c7e5466a5d09 upstream. If the beginning of the inode bitmap area is corrupted on disk, an inode with the same inode number as the root inode can be allocated and fail soon after. In this case, the subsequent call to nilfs_clear_inode() on that bogus root inode will wrongly decrement the reference counter of struct nilfs_root, and this will erroneously free struct nilfs_root, causing kernel oopses. This fixes the problem by changing nilfs_new_inode() to skip reserved inode numbers while repairing the inode bitmap. Link: https://lkml.kernel.org/r/20221003150519.39789-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+b8c672b0e22615c80fe0@syzkaller.appspotmail.com Reported-by: Khalid Masum <khalid.masum.92@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()Ryusuke Konishi1-0/+2
commit 21a87d88c2253350e115029f14fe2a10a7e6c856 upstream. If the i_mode field in inode of metadata files is corrupted on disk, it can cause the initialization of bmap structure, which should have been called from nilfs_read_inode_common(), not to be called. This causes a lockdep warning followed by a NULL pointer dereference at nilfs_bmap_lookup_at_level(). This patch fixes these issues by adding a missing sanitiy check for the i_mode field of metadata file's inode. Link: https://lkml.kernel.org/r/20221002030804.29978-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+2b32eb36c1a825b7a74c@syzkaller.appspotmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05ntfs: fix BUG_ON in ntfs_lookup_inode_by_name()ChenXiaoSong1-1/+2
commit 1b513f613731e2afc05550e8070d79fac80c661e upstream. Syzkaller reported BUG_ON as follows: ------------[ cut here ]------------ kernel BUG at fs/ntfs/dir.c:86! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10 Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c RSP: 0018:ffff888079607978 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000 RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003 RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7 R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110 R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001 FS: 00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> load_system_files+0x1f7f/0x3620 ntfs_fill_super+0xa01/0x1be0 mount_bdev+0x36a/0x440 ntfs_mount+0x3a/0x50 legacy_get_tree+0xfb/0x210 vfs_get_tree+0x8f/0x2f0 do_new_mount+0x30a/0x760 path_mount+0x4de/0x1880 __x64_sys_mount+0x2b3/0x340 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f33b62ff9ea Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0 RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24 R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Fix this by adding sanity check on extended system files' directory inode to ensure that it is directory, just like ntfs_extend_init() when mounting ntfs3. Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05btrfs: fix hang during unmount when stopping a space reclaim workerFilipe Manana1-0/+25
[ Upstream commit a362bb864b8db4861977d00bd2c3222503ccc34b ] Often when running generic/562 from fstests we can hang during unmount, resulting in a trace like this: Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00 Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds. Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1 Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000 Sep 07 11:55:32 debian9 kernel: Call Trace: Sep 07 11:55:32 debian9 kernel: <TASK> Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0 Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70 Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0 Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130 Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0 Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420 Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0 Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200 Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0 Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530 Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140 Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30 Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0 Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170 Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0 Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120 Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30 Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs] Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0 Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160 Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0 Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0 Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40 Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90 Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0 Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570 Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000 Sep 07 11:55:32 debian9 kernel: </TASK> What happens is the following: 1) The cleaner kthread tries to start a transaction to delete an unused block group, but the metadata reservation can not be satisfied right away, so a reservation ticket is created and it starts the async metadata reclaim task (fs_info->async_reclaim_work); 2) Writeback for all the filler inodes with an i_size of 2K starts (generic/562 creates a lot of 2K files with the goal of filling metadata space). We try to create an inline extent for them, but we fail when trying to insert the inline extent with -ENOSPC (at cow_file_range_inline()) - since this is not critical, we fallback to non-inline mode (back to cow_file_range()), reserve extents, create extent maps and create the ordered extents; 3) An unmount starts, enters close_ctree(); 4) The async reclaim task is flushing stuff, entering the flush states one by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current delayed iputs. After running the delayed iputs and before calling btrfs_wait_on_delayed_iputs(), one or more ordered extents complete, and btrfs_add_delayed_iput() is called for each one through btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results in bumping fs_info->nr_delayed_iputs from 0 to some positive value. So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting for fs_info->nr_delayed_iputs to become 0; 5) The current transaction is committed by the transaction kthread, we then start unpinning extents and end up calling btrfs_try_granting_tickets() through unpin_extent_range(), since we released some space. This results in satisfying the ticket created by the cleaner kthread at step 1, waking up the cleaner kthread; 6) At close_ctree() we ask the cleaner kthread to park; 7) The cleaner kthread starts the transaction, deletes the unused block group, and then calls kthread_should_park(), which returns true, so it parks. And at this point we have the delayed iputs added by the completion of the ordered extents still pending; 8) Then later at close_ctree(), when we call: cancel_work_sync(&fs_info->async_reclaim_work); We hang forever, since the cleaner was parked and no one else can run delayed iputs after that, while the reclaim task is waiting for the remaining delayed iputs to be completed. Fix this by waiting for all ordered extents to complete and running the delayed iputs before attempting to stop the async reclaim tasks. Note that we can not wait for ordered extents with btrfs_wait_ordered_roots() (or other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE flag to be set on an ordered extent, but the delayed iput is added after that, when doing the final btrfs_put_ordered_extent(). So instead wait for the work queues used for executing ordered extent completion to be empty, which works because we do the final put on an ordered extent at btrfs_finish_ordered_io() (while we are in the unmount context). Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28ext4: make directory inode spreading reflect flexbg sizeJan Kara1-1/+1
commit 613c5a85898d1cd44e68f28d65eccf64a8ace9cf upstream. Currently the Orlov inode allocator searches for free inodes for a directory only in flex block groups with at most inodes_per_group/16 more directory inodes than average per flex block group. However with growing size of flex block group this becomes unnecessarily strict. Scale allowed difference from average directory count per flex block group with flex block group size as we do with other metrics. Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28ext4: limit the number of retries after discarding preallocations blocksTheodore Ts'o1-1/+3
commit 80fa46d6b9e7b1527bfd2197d75431fd9c382161 upstream. This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0Luís Henriques1-0/+4
commit 29a5b8a137ac8eb410cc823653a29ac0e7b7e1b0 upstream. When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this: [ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28cifs: always initialize struct msghdr smb_msg completelyStefan Metzmacher2-13/+4
[ Upstream commit bedc8f76b3539ac4f952114b316bcc2251e808ce ] So far we were just lucky because the uninitialized members of struct msghdr are not used by default on a SOCK_STREAM tcp socket. But as new things like msg_ubuf and sg_from_iter where added recently, we should play on the safe side and avoid potention problems in future. Signed-off-by: Stefan Metzmacher <metze@samba.org> Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28cifs: use discard iterator to discard unneeded network data more efficientlyDavid Howells3-3/+22
[ Upstream commit cf0604a686b11175d8beae60281c4ccc95aaa5c2 ] The iterator, ITER_DISCARD, that can only be used in READ mode and just discards any data copied to it, was added to allow a network filesystem to discard any unwanted data sent by a server. Convert cifs_discard_from_socket() to use this. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Stable-dep-of: bedc8f76b353 ("cifs: always initialize struct msghdr smb_msg completely") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28xfs: validate inode fork size against fork formatDave Chinner1-9/+26
commit 1eb70f54c445fcbb25817841e774adb3d912f3e8 upstream. [backport for 5.10.y] xfs_repair catches fork size/format mismatches, but the in-kernel verifier doesn't, leading to null pointer failures when attempting to perform operations on the fork. This can occur in the xfs_dir_is_empty() where the in-memory fork format does not match the size and so the fork data pointer is accessed incorrectly. Note: this causes new failures in xfs/348 which is testing mode vs ftype mismatches. We now detect a regular file that has been changed to a directory or symlink mode as being corrupt because the data fork is for a symlink or directory should be in local form when there are only 3 bytes of data in the data fork. Hence the inode verify for the regular file now fires w/ -EFSCORRUPTED because the inode fork format does not match the format the corrupted mode says it should be in. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: reorder iunlink remove operation in xfs_ifreeDave Chinner1-10/+12
commit 9a5280b312e2e7898b6397b2ca3cfd03f67d7be1 upstream. [backport for 5.10.y] The O_TMPFILE creation implementation creates a specific order of operations for inode allocation/freeing and unlinked list modification. Currently both are serialised by the AGI, so the order doesn't strictly matter as long as the are both in the same transaction. However, if we want to move the unlinked list insertions largely out from under the AGI lock, then we have to be concerned about the order in which we do unlinked list modification operations. O_TMPFILE creation tells us this order is inode allocation/free, then unlinked list modification. Change xfs_ifree() to use this same ordering on unlinked list removal. This way we always guarantee that when we enter the iunlinked list removal code from this path, we already have the AGI locked and we don't have to worry about lock nesting AGI reads inside unlink list locks because it's already locked and attached to the transaction. We can do this safely as the inode freeing and unlinked list removal are done in the same transaction and hence are atomic operations with respect to log recovery. Reported-by: Frank Hofmann <fhofmann@cloudflare.com> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: fix up non-directory creation in SGID directoriesChristoph Hellwig1-7/+7
commit 01ea173e103edd5ec41acec65b9261b87e123fc2 upstream. XFS always inherits the SGID bit if it is set on the parent inode, while the generic inode_init_owner does not do this in a few cases where it can create a possible security problem, see commit 0fa3ecd87848 ("Fix up non-directory creation in SGID directories") for details. Switch XFS to use the generic helper for the normal path to fix this, just keeping the simple field inheritance open coded for the case of the non-sgid case with the bsdgrpid mount option. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-23afs: Return -EAGAIN, not -EREMOTEIO, when a file already lockedDavid Howells1-0/+1
[ Upstream commit 0066f1b0e27556381402db3ff31f85d2a2265858 ] When trying to get a file lock on an AFS file, the server may return UAEAGAIN to indicate that the lock is already held. This is currently translated by the default path to -EREMOTEIO. Translate it instead to -EAGAIN so that we know we can retry it. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey E Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/166075761334.3533338.2591992675160918098.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-23cifs: don't send down the destination address to sendmsg for a SOCK_STREAMStefan Metzmacher1-2/+2
commit 17d3df38dc5f4cec9b0ac6eb79c1859b6e2693a4 upstream. This is ignored anyway by the tcp layer. Signed-off-by: Stefan Metzmacher <metze@samba.org> Cc: stable@vger.kernel.org Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-23cifs: revalidate mapping when doing direct writesRonnie Sahlberg1-0/+3
commit 7500a99281dfed2d4a84771c933bcb9e17af279b upstream. Kernel bugzilla: 216301 When doing direct writes we need to also invalidate the mapping in case we have a cached copy of the affected page(s) in memory or else subsequent reads of the data might return the old/stale content before we wrote an update to the server. Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-23NFSv4: Turn off open-by-filehandle and NFS re-export for NFSv4.0Trond Myklebust1-9/+18
[ Upstream commit 2a9d683b48c8a87e61a4215792d44c90bcbbb536 ] The NFSv4.0 protocol only supports open() by name. It cannot therefore be used with open_by_handle() and friends, nor can it be re-exported by knfsd. Reported-by: Chuck Lever III <chuck.lever@oracle.com> Fixes: 20fa19027286 ("nfs: add export operations") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-20tracefs: Only clobber mode/uid/gid on remount if askedBrian Norris1-8/+23
[ Upstream commit 47311db8e8f33011d90dee76b39c8886120cdda4 ] Users may have explicitly configured their tracefs permissions; we shouldn't overwrite those just because a second mount appeared. Only clobber if the options were provided at mount time. Note: the previous behavior was especially surprising in the presence of automounted /sys/kernel/debug/tracing/. Existing behavior: ## Pre-existing status: tracefs is 0755. # stat -c '%A' /sys/kernel/tracing/ drwxr-xr-x ## (Re)trigger the automount. # umount /sys/kernel/debug/tracing # stat -c '%A' /sys/kernel/debug/tracing/. drwx------ ## Unexpected: the automount changed mode for other mount instances. # stat -c '%A' /sys/kernel/tracing/ drwx------ New behavior (after this change): ## Pre-existing status: tracefs is 0755. # stat -c '%A' /sys/kernel/tracing/ drwxr-xr-x ## (Re)trigger the automount. # umount /sys/kernel/debug/tracing # stat -c '%A' /sys/kernel/debug/tracing/. drwxr-xr-x ## Expected: the automount does not change other mount instances. # stat -c '%A' /sys/kernel/tracing/ drwxr-xr-x Link: https://lkml.kernel.org/r/20220826174353.2.Iab6e5ea57963d6deca5311b27fb7226790d44406@changeid Cc: stable@vger.kernel.org Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system") Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15afs: Use the operation issue time instead of the reply time for callbacksDavid Howells5-12/+5
[ Upstream commit 7903192c4b4a82d792cb0dc5e2779a2efe60d45b ] rxrpc and kafs between them try to use the receive timestamp on the first data packet (ie. the one with sequence number 1) as a base from which to calculate the time at which callback promise and lock expiration occurs. However, we don't know how long it took for the server to send us the reply from it having completed the basic part of the operation - it might then, for instance, have to send a bunch of a callback breaks, depending on the particular operation. Fix this by using the time at which the operation is issued on the client as a base instead. That should never be longer than the server's idea of the expiry time. Fixes: 781070551c26 ("afs: Fix calculation of callback expiry time") Fixes: 2070a3e44962 ("rxrpc: Allow the reply time to be obtained on a client call") Suggested-by: Jeffrey E Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>