summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-08-15jbd2: add fast_commit space checkZhang Yi1-4/+12
If JBD2_FEATURE_INCOMPAT_FAST_COMMIT bit is set, it means the journal have fast commit records need to recover, so the fast commit size should not be too large, and the leftover normal journal size should never less than JBD2_MIN_JOURNAL_BLOCKS. If it happens, the journal->j_last is likely to be wrong and will probably lead to incorrect journal recovery. So add a check into the journal_check_superblock(), and drop the pointless check when initializing the fastcommit parameters. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: cleanup load_superblock()Zhang Yi1-50/+35
Rename load_superblock() to journal_load_superblock(), move getting and reading superblock from journal_init_common() and journal_get_superblock() to this function, and also rename journal_get_superblock() to journal_check_superblock(), make it a pure check helper to check superblock validity from disk. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: open code jbd2_verify_csum_type() helperZhang Yi1-13/+5
jbd2_verify_csum_type() helper check checksum type in the superblock for v2 or v3 checksum feature, it always return true if these features are not enabled, and it has only one user, so open code it is more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: checking valid features early in journal_get_superblock()Zhang Yi1-15/+15
journal_get_superblock() is used to check validity of the jounal supberblock, so move the features checks from jbd2_journal_load() to journal_get_superblock(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: don't load superblock in jbd2_journal_check_used_features()Zhang Yi1-5/+0
Since load_superblock() has been moved to journal_init_common(), the in-memory superblock structure is initialized and contains valid data once the file system has a journal_t object, so it's safe to access it, let's drop the call to journal_get_superblock() from jbd2_journal_check_used_features() and also drop the setting/clearing of the veirfy bit of the superblock buffer. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() into journal_init_common()Zhang Yi1-11/+5
Move the call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() early into journal_init_common(), the journal superblock gets read and the in-memory journal_t structure gets initialised after calling jbd2_journal_init_{dev,inode}, it's safe to do following initialization according to it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() dependent functionsZhang Yi1-169/+168
Move load_superblock() declaration and the functions it calls before journal_init_common(). This is a preparation for moving a call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() to journal_init_common(). No functional changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-05ext4: don't use CR_BEST_AVAIL_LEN for non-regular filesRitesh Harjani1-1/+12
Using CR_BEST_AVAIL_LEN only make sense for regular files, as for non-regular files we never normalize the allocation request length i.e. goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len). Hence there is no scope of trimming the goal length to make it satisfy original request len. Thus this patch avoids using CR_BEST_AVAIL_LEN criteria for non-regular files request. Cc: stable@kernel.org Fixes: 33122aa930f1 ("ext4: Add allocation criteria 1.5 (CR1_5)") Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/2a694c748ff8b8c4b416995a24f06f07b55047a8.1689516047.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-05ext4: fix memory leaks in ext4_fname_{setup_filename,prepare_lookup}Luís Henriques1-0/+4
If the filename casefolding fails, we'll be leaking memory from the fscrypt_name struct, namely from the 'crypto_buf.name' member. Make sure we free it in the error path on both ext4_fname_setup_filename() and ext4_fname_prepare_lookup() functions. Cc: stable@kernel.org Fixes: 1ae98e295fa2 ("ext4: optimize match for casefolded encrypted dirs") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230803091713.13239-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: remove unused function '__cp_buffer_busy'Yang Li1-12/+0
The code calling function '__cp_buffer_busy' has been removed, so the function should also be removed. silence the warning: fs/jbd2/checkpoint.c:48:20: warning: unused function '__cp_buffer_busy' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5518 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: check 'jh->b_transaction' before removing it from checkpointZhihao Cheng1-0/+2
Following process will corrupt ext4 image: Step 1: jbd2_journal_commit_transaction __jbd2_journal_insert_checkpoint(jh, commit_transaction) // Put jh into trans1->t_checkpoint_list journal->j_checkpoint_transactions = commit_transaction // Put trans1 into journal->j_checkpoint_transactions Step 2: do_get_write_access test_clear_buffer_dirty(bh) // clear buffer dirty,set jbd dirty __jbd2_journal_file_buffer(jh, transaction) // jh belongs to trans2 Step 3: drop_cache journal_shrink_one_cp_list jbd2_journal_try_remove_checkpoint if (!trylock_buffer(bh)) // lock bh, true if (buffer_dirty(bh)) // buffer is not dirty __jbd2_journal_remove_checkpoint(jh) // remove jh from trans1->t_checkpoint_list Step 4: jbd2_log_do_checkpoint trans1 = journal->j_checkpoint_transactions // jh is not in trans1->t_checkpoint_list jbd2_cleanup_journal_tail(journal) // trans1 is done Step 5: Power cut, trans2 is not committed, jh is lost in next mounting. Fix it by checking 'jh->b_transaction' before remove it from checkpoint. Cc: stable@kernel.org Fixes: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: fix checkpoint cleanup performance regressionZhang Yi1-6/+14
journal_clean_one_cp_list() has been merged into journal_shrink_one_cp_list(), but do chekpoint buffer cleanup from the committing process is just a best effort, it should stop scan once it meet a busy buffer, or else it will cause a lot of invalid buffer scan and checks. We catch a performance regression when doing fs_mark tests below. Test cmd: ./fs_mark -d scratch -s 1024 -n 10000 -t 1 -D 100 -N 100 Before merging checkpoint buffer cleanup: FSUse% Count Size Files/sec App Overhead 95 10000 1024 8304.9 49033 After merging checkpoint buffer cleanup: FSUse% Count Size Files/sec App Overhead 95 10000 1024 7649.0 50012 FSUse% Count Size Files/sec App Overhead 95 10000 1024 2107.1 50871 After merging checkpoint buffer cleanup, the total loop count in journal_shrink_one_cp_list() could be up to 6,261,600+ (50,000+ ~ 100,000+ in general), most of them are invalid. This patch fix it through passing 'shrink_type' into journal_shrink_one_cp_list() and add a new 'SHRINK_BUSY_STOP' to indicate it should stop once meet a busy buffer. After fix, the loop count descending back to 10,000+. After this fix: FSUse% Count Size Files/sec App Overhead 95 10000 1024 8558.4 49109 Cc: stable@kernel.org Fixes: b98dba273a0e ("jbd2: remove journal_clean_one_cp_list()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: correct some stale comment of criteriaKemeng Shi1-4/+4
We named criteria with CR_XXX, correct stale comment to criteria with raw number. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_best_availKemeng Shi1-10/+8
Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_goal_fastKemeng Shi1-8/+6
Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: remove unused ext4_{set}/{clear}_bit_atomicKemeng Shi1-2/+0
Remove ext4_set_bit_atomic and ext4_clear_bit_atomic which are defined but not used. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: replace the traditional ternary conditional operator with with max()/min()Kemeng Shi1-4/+2
Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: remove unnecessary return for void functionKemeng Shi1-9/+1
The return at end of void function is unnecessary, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: use is_power_of_2 helper in ext4_mb_regular_allocatorKemeng Shi1-4/+1
Use intuitive is_power_of_2 helper in ext4_mb_regular_allocator. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_p2_alignedKemeng Shi1-14/+7
Return good group when it's found in loop to remove unnecessary NULL initialization of grp and futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: avoid potential data overflow in next_linear_groupKemeng Shi1-2/+3
ngroups is ext4_group_t (unsigned int) while next_linear_group treat it in int. If ngroups is bigger than max number described by int, it will be treat as a negative number. Then "return group + 1 >= ngroups ? 0 : group + 1;" may keep returning 0. Switch int to ext4_group_t in next_linear_group to fix the overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: correct grp validation in ext4_mb_good_groupKemeng Shi1-1/+1
Group corruption check will access memory of grp and will trigger kernel crash if grp is NULL. So do NULL check before corruption check. Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: replace CR_FAST macro with inline function for readabilityOjaswin Mujoo2-7/+13
Replace CR_FAST with ext4_mb_cr_expensive() inline function for better readability. This function returns true if the criteria is one of the expensive/slower ones where lots of disk IO/prefetching is acceptable. No functional changes are intended in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230630085927.140137-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: replace read-only check for shutdown check in mmp codeJan Kara1-1/+1
The multi-mount protection kthread checks for read-only filesystem and aborts in that case. The remount code actually handles stopping of the kthread on remount so the only purpose of the check is in case of emergency remount read-only. Replace the check for read-only filesystem with a check for shutdown filesystem as running MMP on such is risky anyway and it makes ordering of things during remount simpler. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: drop read-only check from ext4_force_commit()Jan Kara1-7/+1
JBD2 code will quickly return without doing anything when there's nothing to commit so there's no point in the read-only check in ext4_force_commit(). Just drop it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-10-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: drop read-only check in ext4_write_inode()Jan Kara1-2/+1
We should not have dirty inodes on read-only filesystem. Also silently bailing without writing anything would be a problem when we enable quotas during remount while the filesystem is read-only. So drop the read-only check. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: drop read-only check in ext4_init_inode_table()Jan Kara1-6/+0
We better should not be initializing inode tables on read-only filesystem. The following transaction start will warn us and make the function bail anyway so drop the pointless check. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: warn on read-only filesystem in ext4_journal_check_start()Jan Kara1-1/+2
Now that filesystem abort marks the filesystem as shutdown, we shouldn't be ever hitting the sb_rdonly() check in ext4_journal_check_start(). Since this is a suitable place for catching all sorts of programming errors, convert the check to WARN_ON instead of dropping it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: avoid starting transaction on read-only fs in ext4_quota_off()Jan Kara1-0/+7
When the filesystem gets first remounted read-only and then unmounted, ext4_quota_off() will try to start a transaction (and fail) on read-only filesystem to cleanup inode flags for legacy quota files. Just bail before trying to start a transaction instead since that is going to issue a warning. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: drop EXT4_MF_FS_ABORTED flagJan Kara5-14/+10
EXT4_MF_FS_ABORTED flag has practically the same intent as EXT4_FLAGS_SHUTDOWN flag. The shutdown flag is checked in many more places than the aborted flag which is mostly the historical artifact where we were relying on SB_RDONLY checks instead of the aborted flag checks. There are only three places - ext4_sync_file(), __ext4_remount(), and mballoc debug code - which check aborted flag and not shutdown flag and this is arguably a bug. Avoid these inconsistencies by removing EXT4_MF_FS_ABORTED flag and using EXT4_FLAGS_SHUTDOWN everywhere. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: make 'abort' mount option handling standardJan Kara2-14/+3
'abort' mount option is the only mount option that has special handling and sets a bit in sbi->s_mount_flags. There is not strong reason for that so just simplify the code and make 'abort' set a bit in sbi->s_mount_opt2 as any other mount option. This simplifies the code and will allow us to drop EXT4_MF_FS_ABORTED completely in the following patch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: make ext4_forced_shutdown() take struct super_blockJan Kara12-39/+38
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: use sb_rdonly() helper for checking read-only flagJan Kara1-2/+2
sb_rdonly() helper instead of directly checking sb->s_flags. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30ext4: remove pointless sb_rdonly() checks from freezing codeJan Kara1-7/+2
ext4_freeze() and ext4_unfreeze() checks for sb_rdonly(). However this check is pointless as VFS already checks for read-only filesystem before calling filesystem specific methods. Remove the pointless checks. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-30fs: jbd2: fix an incorrect warn logGuoqing Cai1-8/+10
In jbd2_journal_load(), when journal_reset fails, it prints an incorrect warn log. Fix this by changing the goto statement to return statement. Also, return actual error code from jbd2_journal_recover() and journal_reset(). Signed-off-by: Guoqing Cai <u202112087@hust.edu.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230413095740.2222066-1-u202112087@hust.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: avoid overlapping preallocations due to overflowBaokun Li1-11/+10
Let's say we want to allocate 2 blocks starting from 4294966386, after predicting the file size, start is aligned to 4294965248, len is changed to 2048, then end = start + size = 0x100000000. Since end is of type ext4_lblk_t, i.e. uint, end is truncated to 0. This causes (pa->pa_lstart >= end) to always hold when checking if the current extent to be allocated crosses already preallocated blocks, so the resulting ac_g_ex may cross already preallocated blocks. Hence we convert the end type to loff_t and use pa_logical_end() to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: fix BUG in ext4_mb_new_inode_pa() due to overflowBaokun Li1-17/+14
When we calculate the end position of ext4_free_extent, this position may be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not the first case of adjusting the best extent, that is, new_bex_end > 0, the following BUG_ON will be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:5116! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279 RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430 Call Trace: <TASK> ext4_mb_use_best_found+0x203/0x2f0 ext4_mb_try_best_found+0x163/0x240 ext4_mb_regular_allocator+0x158/0x1550 ext4_mb_new_blocks+0x86a/0xe10 ext4_ext_map_blocks+0xb0c/0x13a0 ext4_map_blocks+0x2cd/0x8f0 ext4_iomap_begin+0x27b/0x400 iomap_iter+0x222/0x3d0 __iomap_dio_rw+0x243/0xcb0 iomap_dio_rw+0x16/0x80 ========================================================= A simple reproducer demonstrating the problem: mkfs.ext4 -F /dev/sda -b 4096 100M mount /dev/sda /tmp/test fallocate -l1M /tmp/test/tmp fallocate -l10M /tmp/test/file fallocate -i -o 1M -l16777203M /tmp/test/file fsstress -d /tmp/test -l 0 -n 100000 -p 8 & sleep 10 && killall -9 fsstress rm -f /tmp/test/tmp xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" We simply refactor the logic for adjusting the best extent by adding a temporary ext4_free_extent ex and use extent_logical_end() to avoid overflow, which also simplifies the code. Cc: stable@kernel.org # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: add two helper functions extent_logical_end() and pa_logical_end()Baokun Li2-6/+17
When we use lstart + len to calculate the end of free extent or prealloc space, it may exceed the maximum value of 4294967295(0xffffffff) supported by ext4_lblk_t and cause overflow, which may lead to various problems. Therefore, we add two helper functions, extent_logical_end() and pa_logical_end(), to limit the type of end to loff_t, and also convert lstart to loff_t for calculation to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23Merge tag 'ext4_for_linus-6.5-rc3' of ↵Linus Torvalds5-249/+257
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's checkpoint code" * tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix rbtree traversal bug in ext4_mb_use_preallocated ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail() ext4: correct inline offset when handling xattrs in inode body jbd2: remove __journal_try_to_free_buffer() jbd2: fix a race when checking checkpoint buffer busy jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint jbd2: remove journal_clean_one_cp_list() jbd2: remove t_checkpoint_io_list jbd2: recheck chechpointing non-dirty buffer
2023-07-23Merge tag '6.5-rc2-smb3-client-fixes-ver2' of ↵Linus Torvalds2-6/+15
git://git.samba.org/sfrench/cifs-2.6 Pull smb client fix from Steve French: "Add minor debugging improvement. The change improves ability to read a network trace to debug problems on encrypted connections which are very common (e.g. using wireshark or tcpdump). That works today with tools like 'smbinfo keys /mnt/file' but requires passing in a filename on the mount (see e.g. [1]), but it often makes more sense to just pass in the mount point path (ie a directory not a filename). So this fix was needed to debug some types of problems (an obvious example is on an encrypted connection failing operations on an empty share or with no files in the root of the directory) - so you can simply pass in the 'smbinfo keys <mntpoint>' and get the information that wireshark needs" Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1] * tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number for cifs.ko cifs: allow dumping keys for directories too
2023-07-23ext4: fix rbtree traversal bug in ext4_mb_use_preallocatedOjaswin Mujoo1-27/+131
During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()Ojaswin Mujoo1-5/+9
In ext4_mb_choose_next_group_best_avail(), we want the start order to be 1 less than goal length and the min_order to be, at max, 1 more than the original length. This commit fixes an off by one issue that arose due to the fact that 1 << fls(n) > (n). After all the processing: order = 1 order below goal len min_order = maximum of the three:- - order - trim_order - 1 order below B2C(s_stripe) - 1 order above original len Cc: stable@kernel.org Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: correct inline offset when handling xattrs in inode bodyEric Whitney1-0/+14
When run on a file system where the inline_data feature has been enabled, xfstests generic/269, generic/270, and generic/476 cause ext4 to emit error messages indicating that inline directory entries are corrupted. This occurs because the inline offset used to locate inline directory entries in the inode body is not updated when an xattr in that shared region is deleted and the region is shifted in memory to recover the space it occupied. If the deleted xattr precedes the system.data attribute, which points to the inline directory entries, that attribute will be moved further up in the region. The inline offset continues to point to whatever is located in system.data's former location, with unfortunate effects when used to access directory entries or (presumably) inline data in the inode body. Cc: stable@kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-22cifs: update internal module version number for cifs.koSteve French1-2/+2
From 2.43 to 2.44 Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-22cifs: allow dumping keys for directories tooShyam Prasad N1-4/+13
Dumping the enc/dec keys is a session wide operation. And it should not matter if the ioctl was run on a regular file or a directory. Currently, we obtain the tcon pointer from the cifs file handle. But since there's no dir open call in cifs, this is not populated for dirs. This change allows dumping of session keys using ioctl even for directories. To do this, we'll now get the tcon pointer from the superblock, and not from the file handle. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-20Merge tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+2
Pull iomap fix from Darrick Wong: "Fix partial write regression. It turns out that fstests doesn't have any test coverage for short writes, but LTP does. Fortunately, this was caught right after -rc1 was tagged. Summary: - Fix a bug wherein a failed write could clobber short write status" * tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write iomap: fix a regression for partial write errors
2023-07-20Merge tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-13/+71
Pull xfs fixes from Darrick Wong: "Flexarray declaration conversions. This probably should've been done with the merge window open, but I was not aware that the UBSAN knob would be getting turned up for 6.5, and the fstests failures due to the kernel warnings are getting in the way of testing. Summary: - Convert all the array[1] declarations into the accepted flex array[] declarations so that UBSAN and friends will not get confused" * tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: convert flex-array declarations in xfs attr shortform objects xfs: convert flex-array declarations in xfs attr leaf blocks xfs: convert flex-array declarations in struct xfs_attrlist*
2023-07-20Merge tag 'for-6.5-rc2-tag' of ↵Linus Torvalds6-46/+79
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Stable fixes: - fix race between balance and cancel/pause - various iput() fixes - fix use-after-free of new block group that became unused - fix warning when putting transaction with qgroups enabled after abort - fix crash in subpage mode when page could be released between map and map read - when scrubbing raid56 verify the P/Q stripes unconditionally - fix minor memory leak in zoned mode when a block group with an unexpected superblock is found Regression fixes: - fix ordered extent split error handling when submitting direct IO - user irq-safe locking when adding delayed iputs" * tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix warning when putting transaction with qgroups enabled after abort btrfs: fix ordered extent split error handling in btrfs_dio_submit_io btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand btrfs: raid56: always verify the P/Q contents for scrub btrfs: use irq safe locking when running and adding delayed iputs btrfs: fix iput() on error pointer after error during orphan cleanup btrfs: fix double iput() on inode after an error during orphan cleanup btrfs: zoned: fix memory leak after finding block group with super blocks btrfs: fix use-after-free of new block group that became unused btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block btrfs: fix race between balance and cancel/pause
2023-07-19Merge tag 'fuse-update-6.5' of ↵Linus Torvalds3-13/+20
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Small but important fixes and a trivial cleanup" * tag 'fuse-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: ioctl: translate ENOSYS in outarg fuse: revalidate: don't invalidate if interrupted fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT fuse: remove duplicate check for nodeid fuse: add feature flag for expire-only
2023-07-18btrfs: fix warning when putting transaction with qgroups enabled after abortFilipe Manana1-0/+1
If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>