summaryrefslogtreecommitdiff
path: root/fs/ext4/balloc.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-27ext4: add correct group descriptors and reserved GDT blocks to system zoneWang Jianjian1-4/+11
When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: stable@kernel.org Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com> Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15ext4: drop the call to ext4_error() from ext4_get_group_info()Fabio M. De Francesco1-11/+9
A recent patch added a call to ext4_error() which is problematic since some callers of the ext4_get_group_info() function may be holding a spinlock, whereas ext4_error() must never be called in atomic context. This triggered a report from Syzbot: "BUG: sleeping function called from invalid context in ext4_update_super" (see the link below). Therefore, drop the call to ext4_error() from ext4_get_group_info(). In the meantime use eight characters tabs instead of nine characters ones. Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/ Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
2023-06-15Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"Kemeng Shi1-1/+4
This reverts commit ad3f09be6cfe332be8ff46c78e6ec0f8839107aa. The reverted commit was intended to simpfy the code to get group descriptor block number in non-meta block group by assuming s_gdb_count is block number used for all non-meta block group descriptors. However s_gdb_count is block number used for all meta *and* non-meta group descriptors. So s_gdb_group will be > actual group descriptor block number used for all non-meta block group which should be "total non-meta block group" / "group descriptors per block", e.g. s_first_meta_bg. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com Fixes: ad3f09be6cfe ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa") Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: allow ext4_get_group_info() to failTheodore Ts'o1-1/+17
Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2023-05-08ext4: fix WARNING in mb_find_extentYe Bin1-0/+25
Syzbot found the following issue: EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support! EXT4-fs (loop0): orphan cleanup on readonly fs ------------[ cut here ]------------ WARNING: CPU: 1 PID: 5067 at fs/ext4/mballoc.c:1869 mb_find_extent+0x8a1/0xe30 Modules linked in: CPU: 1 PID: 5067 Comm: syz-executor307 Not tainted 6.2.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:mb_find_extent+0x8a1/0xe30 fs/ext4/mballoc.c:1869 RSP: 0018:ffffc90003c9e098 EFLAGS: 00010293 RAX: ffffffff82405731 RBX: 0000000000000041 RCX: ffff8880783457c0 RDX: 0000000000000000 RSI: 0000000000000041 RDI: 0000000000000040 RBP: 0000000000000040 R08: ffffffff82405723 R09: ffffed10053c9402 R10: ffffed10053c9402 R11: 1ffff110053c9401 R12: 0000000000000000 R13: ffffc90003c9e538 R14: dffffc0000000000 R15: ffffc90003c9e2cc FS: 0000555556665300(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056312f6796f8 CR3: 0000000022437000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_mb_complex_scan_group+0x353/0x1100 fs/ext4/mballoc.c:2307 ext4_mb_regular_allocator+0x1533/0x3860 fs/ext4/mballoc.c:2735 ext4_mb_new_blocks+0xddf/0x3db0 fs/ext4/mballoc.c:5605 ext4_ext_map_blocks+0x1868/0x6880 fs/ext4/extents.c:4286 ext4_map_blocks+0xa49/0x1cc0 fs/ext4/inode.c:651 ext4_getblk+0x1b9/0x770 fs/ext4/inode.c:864 ext4_bread+0x2a/0x170 fs/ext4/inode.c:920 ext4_quota_write+0x225/0x570 fs/ext4/super.c:7105 write_blk fs/quota/quota_tree.c:64 [inline] get_free_dqblk+0x34a/0x6d0 fs/quota/quota_tree.c:130 do_insert_tree+0x26b/0x1aa0 fs/quota/quota_tree.c:340 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375 dq_insert_tree fs/quota/quota_tree.c:401 [inline] qtree_write_dquot+0x3b6/0x530 fs/quota/quota_tree.c:420 v2_write_dquot+0x11b/0x190 fs/quota/quota_v2.c:358 dquot_acquire+0x348/0x670 fs/quota/dquot.c:444 ext4_acquire_dquot+0x2dc/0x400 fs/ext4/super.c:6740 dqget+0x999/0xdc0 fs/quota/dquot.c:914 __dquot_initialize+0x3d0/0xcf0 fs/quota/dquot.c:1492 ext4_process_orphan+0x57/0x2d0 fs/ext4/orphan.c:329 ext4_orphan_cleanup+0xb60/0x1340 fs/ext4/orphan.c:474 __ext4_fill_super fs/ext4/super.c:5516 [inline] ext4_fill_super+0x81cd/0x8700 fs/ext4/super.c:5644 get_tree_bdev+0x400/0x620 fs/super.c:1282 vfs_get_tree+0x88/0x270 fs/super.c:1489 do_new_mount+0x289/0xad0 fs/namespace.c:3145 do_mount fs/namespace.c:3488 [inline] __do_sys_mount fs/namespace.c:3697 [inline] __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Add some debug information: mb_find_extent: mb_find_extent block=41, order=0 needed=64 next=0 ex=0/41/1@3735929054 64 64 7 block_bitmap: ff 3f 0c 00 fc 01 00 00 d2 3d 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Acctually, blocks per group is 64, but block bitmap indicate at least has 128 blocks. Now, ext4_validate_block_bitmap() didn't check invalid block's bitmap if set. To resolve above issue, add check like fsck "Padding at end of block bitmap is not set". Cc: stable@kernel.org Reported-by: syzbot+68223fe9f6c95ad43bed@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116020015.1506120-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: remove unused group parameter in ext4_block_bitmap_csum_verifyKemeng Shi1-2/+1
Remove unused group parameter in ext4_block_bitmap_csum_verify. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: improve inode table blocks counting in ext4_num_overhead_clustersKemeng Shi1-43/+47
As inode table blocks are contiguous, inode table blocks inside the block_group can be represented as range [itbl_cluster_start, itbl_cluster_last]. Then we can simply account inode table cluters and check cluster overlap with [itbl_cluster_start, itbl_cluster_last] instead of traverse each block of inode table. By the way, this patch fixes code style problem of comment for ext4_num_overhead_clusters. [ Merged fix-up patch which fixed potentially access to an uninitialzied stack variable. --TYT ] Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-8-shikemeng@huaweicloud.com Link: https://lore.kernel.org/r/202303171446.eLEhZzAu-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: stop trying to verify just initialized bitmap in ↵Kemeng Shi1-1/+1
ext4_read_block_bitmap_nowait For case we initialize a bitmap bh, we will set bitmap bh verified. We can return immediately instead of goto verify to remove unnecessary work for trying to verify bitmap bh. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: remove stale comment in ext4_init_block_bitmapKemeng Shi1-2/+0
Commit bdfb6ff4a255d ("ext4: mark group corrupt on group descriptor checksum") added flag to indicate corruption of group instead of marking all blocks used. Just remove stale comment. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: remove unnecessary check in ext4_bg_num_gdb_nometaKemeng Shi1-4/+1
We only call ext4_bg_num_gdb_nometa if there is no meta_bg feature or group does not reach first_meta_bg, so group must not reside in meta group. Then group has a global gdt if superblock exists. Just remove confusing branch that meta_bg feature exists. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: call ext4_bg_num_gdb_[no]meta directly in ext4_num_base_meta_clustersKemeng Shi1-2/+2
ext4_num_base_meta_clusters is already aware of meta_bg feature and test if block_group is inside real meta block groups before calling ext4_bg_num_gdb. Then ext4_bg_num_gdb will check if block group is inside a real meta block groups again to decide either ext4_bg_num_gdb_meta or ext4_bg_num_gdb_nometa is needed. Call ext4_bg_num_gdb_meta or ext4_bg_num_gdb_nometa directly after we check if block_group is inside a meta block groups in ext4_num_base_meta_clusters to remove redundant check of meta block groups in ext4_bg_num_gdb. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: correct validation check of inode table in ext4_valid_block_bitmapKemeng Shi1-3/+3
1.Last valid cluster of inode table is EXT4_B2C(sbi, offset + sbi->s_itb_per_group - 1). We should make sure last valid cluster is < max_bit, i.e., EXT4_B2C(sbi, offset + sbi->s_itb_per_group - 1) is < max_bit rather than EXT4_B2C(sbi, offset + sbi->s_itb_per_group) is < max_bit. 2.Bit search length should be last valid cluster plus 1, i.e., EXT4_B2C(sbi, offset + sbi->s_itb_per_group - 1) + 1 rather than EXT4_B2C(sbi, offset + sbi->s_itb_per_group). Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-24ext4: properly handle error of ext4_init_block_bitmap in ↵Kemeng Shi1-5/+7
ext4_read_block_bitmap_nowait We mark buffer_head of bitmap successfully initialized even error occurs in ext4_init_block_bitmap. Although we will return error, we will get a invalid buffer_head of bitmap from next ext4_read_block_bitmap_nowait which is marked buffer_verified but not successfully initialized actually in previous ext4_read_block_bitmap_nowait. Fix this by only marking buffer_head successfully initialized if ext4_init_block_bitmap successes. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230221115919.1918161-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-03ext4: use ext4_debug() instead of jbd_debug()Jan Kara1-1/+1
We use jbd_debug() in some places in ext4. It seems a bit strange to use jbd2 debugging output function for ext4 code. Also these days ext4_debug() uses dynamic printk so each debug message can be enabled / disabled on its own so the time when it made some sense to have these combined (to allow easier common selecting of messages to report) has passed. Just convert all jbd_debug() uses in ext4 to ext4_debug(). Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-16ext4: fix kernel doc warningsTheodore Ts'o1-0/+1
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-31ext4: flush background discard kwork when retry allocationWang Jianchao1-1/+7
The background discard kwork tries to mark blocks used and issue discard. This can make filesystem suffer from NOSPC error, xfstest generic/371 can fail due to it. Fix it by flushing discard kwork in ext4_should_retry_alloc. At the same time, give up discard at the moment. Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-10ext4: fix trailing whitespaceJack Qiu1-1/+1
Made suggested modifications from checkpatch in reference to ERROR: trailing whitespace Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06ext4: shrink race window in ext4_should_retry_alloc()Eric Whitney1-12/+26
When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it fails at significant rates on the two test scenarios that disable delayed allocation (ext3conv and data_journal) and force actual block allocation for the fallocate and pwrite functions in the test. The failure rate on 5.10 for both ext3conv and data_journal on one test system typically runs about 85%. On 5.11, the failure rate on ext3conv sometimes drops to as low as 1% while the rate on data_journal increases to nearly 100%. The observed failures are largely due to ext4_should_retry_alloc() cutting off block allocation retries when s_mb_free_pending (used to indicate that a transaction in progress will free blocks) is 0. However, free space is usually available when this occurs during runs of generic/371. It appears that a thread attempting to allocate blocks is just missing transaction commits in other threads that increase the free cluster count and reset s_mb_free_pending while the allocating thread isn't running. Explicitly testing for free space availability avoids this race. The current code uses a post-increment operator in the conditional expression that determines whether the retry limit has been exceeded. This means that the conditional expression uses the value of the retry counter before it's increased, resulting in an extra retry cycle. The current code actually retries twice before hitting its retry limit rather than once. Increasing the retry limit to 3 from the current actual maximum retry count of 2 in combination with the change described above reduces the observed failure rate to less that 0.1% on both ext3conv and data_journal with what should be limited impact on users sensitive to the overhead caused by retries. A per filesystem percpu counter exported via sysfs is added to allow users or developers to track the number of times the retry limit is exceeded without resorting to debugging methods. This should provide some insight into worst case retry behavior. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03ext4: use ASSERT() to replace J_ASSERT()Chunguang Xu1-1/+1
There are currently multiple forms of assertion, such as J_ASSERT(). J_ASEERT() is provided for the jbd module, which is a public module. Maybe we should use custom ASSERT() like other file systems, such as xfs, which would be better. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-22ext4: fast commit recovery pathHarshad Shirwadkar1-1/+6
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use common helpers in all places reading metadata bufferszhangyi (F)1-5/+3
Revome all open codes that read metadata buffers, switch to use ext4_read_bh_*() common helpers. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: clear buffer verified flag if read meta block from diskzhangyi (F)1-0/+1
The metadata buffer is no longer trusted after we read it from disk again because it is not uptodate for some reasons (e.g. failed to write back). Otherwise we may get below memory corruption problem in ext4_ext_split()->memset() if we read stale data from the newly allocated extent block on disk which has been failed to async write out but miss verify again since the verified bit has already been set on the buffer. [ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000 ... [ 29.783317] Oops: 0002 [#2] SMP [ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800 [ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W [ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8 [ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364 [ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000 [ 29.789288] Workqueue: writeback wb_workfn [ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.790321] (flush-8:0) [ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0 [ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 29.792839] RIP: 0010:__memset+0x24/0x30 [ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033 [ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt ... [ 29.808149] Call Trace: [ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.810161] ext4_writepages+0xc85/0x17c0 ... Fix this by clearing buffer's verified bit if we read meta block from disk again. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07ext4: indicate via a block bitmap read is prefetched via a tracepointTheodore Ts'o1-1/+1
Modify the ext4_read_block_bitmap_load tracepoint so that it tells us whether a block bitmap is being prefetched. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
2020-08-06ext4: add prefetching for block allocation bitmapsAlex Zhuravlev1-3/+11
This should significantly improve bitmap loading, especially for flex groups as it tries to load all bitmaps within a flex.group instead of one by one synchronously. Prefetching is done in 8 * flex_bg groups, so it should be 8 read-ahead reads for a single allocating thread. At the end of allocation the thread waits for read-ahead completion and initializes buddy information so that read-aheads are not lost in case of memory pressure. At cr=0 the number of prefetching IOs is limited per allocation context to prevent a situation when mballoc loads thousands of bitmaps looking for a perfect group and ignoring groups with good chunks. Together with the patch "ext4: limit scanning of uninitialized groups" the mount time (which includes few tiny allocations) of a 1PB filesystem is reduced significantly: 0% full 50%-full unpatched patched mount time 33s 9279s 563s [ Restructured by tytso; removed the state flags in the allocation context, so it can be used to lazily prefetch the allocation bitmaps immediately after the file system is mounted. Skip prefetching block groups which are uninitialized. Finally pass in the REQ_RAHEAD flag to the block layer while prefetching. ] Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04ext4: balloc: use task_pid_nr() helperRitesh Harjani1-2/+3
Use task_pid_nr() function instead of current->pid. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-16ext4: fix return-value types in several function commentsJosh Triplett1-2/+2
The documentation comments for ext4_read_block_bitmap_nowait and ext4_read_inode_bitmap describe them as returning NULL on error, but they return an ERR_PTR on error; update the documentation to match. The documentation comment for ext4_wait_block_bitmap describes it as returning 1 on error, but it returns -errno on error; update the documentation to match. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-02ext4: save all error info in save_error_info() and drop ext4_set_errno()Theodore Ts'o1-4/+3
Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.) Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-21ext4: fix potential race between online resizing and write operationsTheodore Ts'o1-3/+11
During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu Reported-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2019-12-26ext4: simulate various I/O and checksum errors when reading metadataTheodore Ts'o1-1/+3
This allows us to test various error handling code paths Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: save the error code which triggered an ext4_error() in the superblockTheodore Ts'o1-0/+1
This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED. Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-19ext4: clean up kerneldoc warnigns when building with W=1Theodore Ts'o1-2/+2
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-08-01ext4: use ext4_warning() for sb_getblk failureWang Shilong1-3/+3
Out of memory should not be considered as critical errors; so replace ext4_error() with ext4_warnig(). Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-07-13ext4: check for allocation block validity with block group lockedTheodore Ts'o1-0/+3
With commit 044e6e3d74a3: "ext4: don't update checksum of new initialized bitmaps" the buffer valid bit will get set without actually setting up the checksum for the allocation bitmap, since the checksum will get calculated once we actually allocate an inode or block. If we are doing this, then we need to (re-)check the verified bit after we take the block group lock. Otherwise, we could race with another process reading and verifying the bitmap, which would then complain about the checksum being invalid. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-06-14ext4: only look at the bg_flags field if it is validTheodore Ts'o1-1/+10
The bg_flags field in the block group descripts is only valid if the uninit_bg or metadata_csum feature is enabled. We were not consistently looking at this field; fix this. Also block group #0 must never have uninitialized allocation bitmaps, or need to be zeroed, since that's where the root inode, and other special inodes are set up. Check for these conditions and mark the file system as corrupted if they are detected. This addresses CVE-2018-10876. https://bugzilla.kernel.org/show_bug.cgi?id=199403 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-06-14ext4: always check block group bounds in ext4_init_block_bitmap()Theodore Ts'o1-7/+3
Regardless of whether the flex_bg feature is set, we should always check to make sure the bits we are setting in the block bitmap are within the block group bounds. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-05-12ext4: mark block bitmap corrupted when foundWang Shilong1-0/+4
There are still some cases that we missed to set block bitmaps corrupted bit properly: 1) block bitmap number is wrong. 2) failed to read block bitmap due to disk errors. 3) double free block bitmaps.. 4) some mismatch check with bitmaps vs buddy information. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-05-12ext4: add new ext4_mark_group_bitmap_corrupted() helperWang Shilong1-22/+7
Since there are many places to set inode/block bitmap corrupt bit, add a new helper for it, which will make codes more clear. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-04-24ext4: fix bitmap position validationLukas Czerner1-4/+5
Currently in ext4_valid_block_bitmap() we expect the bitmap to be positioned anywhere between 0 and s_blocksize clusters, but that's wrong because the bitmap can be placed anywhere in the block group. This causes false positives when validating bitmaps on perfectly valid file system layouts. Fix it by checking whether the bitmap is within the group boundary. The problem can be reproduced using the following mkfs -t ext3 -E stride=256 /dev/vdb1 mount /dev/vdb1 /mnt/test cd /mnt/test wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.3.tar.xz tar xf linux-4.16.3.tar.xz This will result in the warnings in the logs EXT4-fs error (device vdb1): ext4_validate_block_bitmap:399: comm tar: bg 84: block 2774529: invalid block bitmap [ Changed slightly for clarity and to not drop a overflow test -- TYT ] Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Ilya Dryomov <idryomov@gmail.com> Fixes: 7dac4a1726a9 ("ext4: add validity checks for bitmap block numbers") Cc: stable@vger.kernel.org
2018-03-27ext4: add validity checks for bitmap block numbersTheodore Ts'o1-2/+14
An privileged attacker can cause a crash by mounting a crafted ext4 image which triggers a out-of-bounds read in the function ext4_valid_block_bitmap() in fs/ext4/balloc.c. This issue has been assigned CVE-2018-1093. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199181 BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1560782 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-02-19ext4: don't update checksum of new initialized bitmapsTheodore Ts'o1-2/+1
When reading the inode or block allocation bitmap, if the bitmap needs to be initialized, do not update the checksum in the block group descriptor. That's because we're not set up to journal those changes. Instead, just set the verified bit on the bitmap block, so that it's not necessary to validate the checksum. When a block or inode allocation actually happens, at that point the checksum will be calculated, and update of the bg descriptor block will be properly journalled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-01-11ext4: use 'sbi' instead of 'EXT4_SB(sb)'Jun Piao1-2/+2
We could use 'sbi' instead of 'EXT4_SB(sb)' to make code more elegant. Signed-off-by: Jun Piao <piaojun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-11-14Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-8/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - Add support for online resizing of file systems with bigalloc - Fix a two data corruption bugs involving DAX, as well as a corruption bug after a crash during a racing fallocate and delayed allocation. - Finally, a number of cleanups and optimizations. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: improve smp scalability for inode generation ext4: add support for online resizing with bigalloc ext4: mention noload when recovering on read-only device Documentation: fix little inconsistencies ext4: convert timers to use timer_setup() jbd2: convert timers to use timer_setup() ext4: remove duplicate extended attributes defs ext4: add ext4_should_use_dax() ext4: add sanity check for encryption + DAX ext4: prevent data corruption with journaling + DAX ext4: prevent data corruption with inline data + DAX ext4: fix interaction between i_size, fallocate, and delalloc after a crash ext4: retry allocations conservatively ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA ext4: Add iomap support for inline data iomap: Add IOMAP_F_DATA_INLINE flag iomap: Switch from blkno to disk offset
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-02ext4: retry allocations conservativelyTheodore Ts'o1-8/+7
Now that we no longer try to reserve metadata blocks for delayed allocations (which tended to overestimate the required number of blocks significantly), we really don't need retry allocations when the disk is very full as aggressively any more. The only time when it makes sense to retry an allocation is if we have freshly deleted blocks that will only become available after a transaction commit. And if we lose that race, it's not worth it to try more than once. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-27Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The major change this cycle is deleting ext4's copy of the file system encryption code and switching things over to using the copies in fs/crypto. I've updated the MAINTAINERS file to add an entry for fs/crypto listing Jaeguk Kim and myself as the maintainers. There are also a number of bug fixes, most notably for some problems found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also fixed is a writeback deadlock detected by generic/130, and some potential races in the metadata checksum code" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: verify extent header depth ext4: short-cut orphan cleanup on error ext4: fix reference counting bug on block allocation error MAINTAINRES: fs-crypto maintainers update ext4 crypto: migrate into vfs's crypto engine ext2: fix filesystem deadlock while reading corrupted xattr block ext4: fix project quota accounting without quota limits enabled ext4: validate s_reserved_gdt_blocks on mount ext4: remove unused page_idx ext4: don't call ext4_should_journal_data() on the journal inode ext4: Fix WARN_ON_ONCE in ext4_commit_super() ext4: fix deadlock during page writeback ext4: correct error value of function verifying dx checksum ext4: avoid modifying checksum fields directly during checksum verification ext4: check for extents that wrap around jbd2: make journal y2038 safe jbd2: track more dependencies on transaction commit jbd2: move lockdep tracking to journal_s jbd2: move lockdep instrumentation for jbd2 handles ext4: respect the nobarrier mount option in nojournal mode ...
2016-07-06ext4: validate s_reserved_gdt_blocks on mountTheodore Ts'o1-0/+3
If s_reserved_gdt_blocks is extremely large, it's possible for ext4_init_block_bitmap(), which is called when ext4 sets up an uninitialized block bitmap, to corrupt random kernel memory. Add the same checks which e2fsck has --- it must never be larger than blocksize / sizeof(__u32) --- and then add a backup check in ext4_init_block_bitmap() in case the superblock gets modified after the file system is mounted. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2016-06-27ext4: optimize ext4_should_retry_alloc() to improve ENOSPC performanceTheodore Ts'o1-1/+3
If there are no pending blocks to be released after a commit, forcing a journal commit has no hope of helping. It's possible that a commit had just completed, so if there are now free blocks available for allocation, it's worth retrying the commit. Reported-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-06-07fs: have submit_bh users pass in op and flags separatelyMike Christie1-1/+1
This has submit_bh users pass in the operation and flags separately, so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that is submitted. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-13ext4: fix race in transient ENOSPC detectionJan Kara1-1/+2
When there are blocks to free in the running transaction, block allocator can return ENOSPC although the filesystem has some blocks to free. We use ext4_should_retry_alloc() to force commit of the current transaction and return whether anything was committed so that it makes sense to retry the allocation. However the transaction may get committed after block allocation fails but before we call ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false because there is nothing to commit and we wrongly return ENOSPC. Fix the race by unconditionally returning 1 from ext4_should_retry_alloc() when we tried to commit a transaction. This should not add any unnecessary retries since we had a transaction running a while ago when trying to allocate blocks and we want to retry the allocation once that transaction has committed anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12ext4: fix scheduling in atomic on group checksum failureJan Kara1-3/+4
When block group checksum is wrong, we call ext4_error() while holding group spinlock from ext4_init_block_bitmap() or ext4_init_inode_bitmap() which results in scheduling while in atomic. Fix the issue by calling ext4_error() later after dropping the spinlock. CC: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>