summaryrefslogtreecommitdiff
path: root/fs/ext4/namei.c
AgeCommit message (Collapse)AuthorFilesLines
2021-01-16Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-10/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A number of bug fixes for ext4: - Fix for the new fast_commit feature - Fix some error handling codepaths in whiteout handling and mountpoint sampling - Fix how we write ext4_error information so it goes through the journal when journalling is active, to avoid races that can lead to lost error information, superblock checksum failures, or DIF/DIX features" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove expensive flush on fast commit ext4: fix bug for rename with RENAME_WHITEOUT ext4: fix wrong list_splice in ext4_fc_cleanup ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR ext4: don't leak old mountpoint samples ext4: drop ext4_handle_dirty_super() ext4: fix superblock checksum failure when setting password salt ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super() ext4: save error info to sb through journal if available ext4: protect superblock modifications with a buffer lock ext4: drop sync argument of ext4_commit_super() ext4: combine ext4_handle_error() and save_error_info()
2021-01-15ext4: fix bug for rename with RENAME_WHITEOUTyangerkun1-8/+9
We got a "deleted inode referenced" warning cross our fsstress test. The bug can be reproduced easily with following steps: cd /dev/shm mkdir test/ fallocate -l 128M img mkfs.ext4 -b 1024 img mount img test/ dd if=/dev/zero of=test/foo bs=1M count=128 mkdir test/dir/ && cd test/dir/ for ((i=0;i<1000;i++)); do touch file$i; done # consume all block cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD, /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in ext4_rename will return ENOSPC!! cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1 We will get the output: "ls: cannot access 'test/dir/file1': Structure needs cleaning" and the dmesg show: "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls: deleted inode referenced: 139" ext4_rename will create a special inode for whiteout and use this 'ino' to replace the source file's dir entry 'ino'. Once error happens latter(the error above was the ENOSPC return from ext4_add_entry in ext4_rename since all space has been consumed), the cleanup do drop the nlink for whiteout, but forget to restore 'ino' with source file. This will trigger the bug describle as above. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Fixes: cd808deced43 ("ext4: support RENAME_WHITEOUT") Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-25Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-8/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various bug fixes and cleanups for ext4; no new features this cycle" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits) ext4: remove unnecessary wbc parameter from ext4_bio_write_page ext4: avoid s_mb_prefetch to be zero in individual scenarios ext4: defer saving error info from atomic context ext4: simplify ext4 error translation ext4: move functions in super.c ext4: make ext4_abort() use __ext4_error() ext4: standardize error message in ext4_protect_reserved_inode() ext4: remove redundant sb checksum recomputation ext4: don't remount read-only with errors=continue on reboot ext4: fix deadlock with fs freezing and EA inodes jbd2: add a helper to find out number of fast commit blocks ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h ext4: fix fall-through warnings for Clang ext4: add docs about fast commit idempotence ext4: remove the unused EXT4_CURRENT_REV macro ext4: fix an IS_ERR() vs NULL check ext4: check for invalid block size early when mounting a file system ext4: fix a memory leak of ext4_free_data ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set() ext4: update ext4_data_block_valid related comments ...
2020-12-22ext4: drop ext4_handle_dirty_super()Jan Kara1-2/+2
The wrapper is now useless since it does what ext4_handle_dirty_metadata() does. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: protect superblock modifications with a buffer lockJan Kara1-0/+6
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17Merge tag 'f2fs-for-5.11-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've made more work into per-file compression support. For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to change the algorithm or cluster size per file. F2FS_IOC_COMPRESS | DECOMPRESS_FILE provides a way to compress and decompress the existing normal files manually. There is also a new mount option, compress_mode=fs|user, which can control who compresses the data. Chao also added a checksum feature with a mount option so that we are able to detect any corrupted cluster. In addition, Daniel contributed casefolding with encryption patch, which will be used for Android devices. Summary: Enhancements: - add ioctls and mount option to manage per-file compression feature - support casefolding with encryption - support checksum for compressed cluster - avoid IO starvation by replacing mutex with rwsem - add sysfs, max_io_bytes, to control max bio size Bug fixes: - fix use-after-free issue when compression and fsverity are enabled - fix consistency corruption during fault injection test - fix data offset for lseek - get rid of buffer_head which has 32bits limit in fiemap - fix some bugs in multi-partitions support - fix nat entry count calculation in shrinker - fix some stat information And, we've refactored some logics and fix minor bugs as well" * tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits) f2fs: compress: fix compression chksum f2fs: fix shift-out-of-bounds in sanity_check_raw_super() f2fs: fix race of pending_pages in decompression f2fs: fix to account inline xattr correctly during recovery f2fs: inline: fix wrong inline inode stat f2fs: inline: correct comment in f2fs_recover_inline_data f2fs: don't check PAGE_SIZE again in sanity_check_raw_super() f2fs: convert to F2FS_*_INO macro f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size f2fs: don't allow any writes on readonly mount f2fs: avoid race condition for shrinker count f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE f2fs: add compress_mode mount option f2fs: Remove unnecessary unlikely() f2fs: init dirty_secmap incorrectly f2fs: remove buffer_head which has 32bits limit f2fs: fix wrong block count instead of bytes f2fs: use new conversion functions between blks and bytes f2fs: rename logical_to_blk and blk_to_logical f2fs: fix kbytes written stat for multi-device case ...
2020-12-03ext4: use ASSERT() to replace J_ASSERT()Chunguang Xu1-8/+4
There are currently multiple forms of assertion, such as J_ASSERT(). J_ASEERT() is provided for the jbd module, which is a public module. Maybe we should use custom ASSERT() like other file systems, such as xfs, which would be better. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03fscrypt: Have filesystems handle their d_opsDaniel Rosenberg1-0/+1
This shifts the responsibility of setting up dentry operations from fscrypt to the individual filesystems, allowing them to have their own operations while still setting fscrypt's d_revalidate as appropriate. Most filesystems can just use generic_set_encrypted_ci_d_ops, unless they have their own specific dentry operations as well. That operation will set the minimal d_ops required under the circumstances. Since the fscrypt d_ops are set later on, we must set all d_ops there, since we cannot adjust those later on. This should not result in any change in behavior. Signed-off-by: Daniel Rosenberg <drosen@google.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-03fscrypt: introduce fscrypt_prepare_readdir()Eric Biggers1-1/+1
The last remaining use of fscrypt_get_encryption_info() from filesystems is for readdir (->iterate_shared()). Every other call is now in fs/crypto/ as part of some other higher-level operation. We need to add a new argument to fscrypt_get_encryption_info() to indicate whether the encryption policy is allowed to be unrecognized or not. Doing this is easier if we can work with high-level operations rather than direct filesystem use of fscrypt_get_encryption_info(). So add a function fscrypt_prepare_readdir() which wraps the call to fscrypt_get_encryption_info() for the readdir use case. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201203022041.230976-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-03ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf()Eric Biggers1-7/+1
The call to fscrypt_get_encryption_info() in dx_show_leaf() is too low in the call tree; fscrypt_get_encryption_info() should have already been called when starting the directory operation. And indeed, it already is. Moreover, the encryption key is guaranteed to already be available because dx_show_leaf() is only called when adding a new directory entry. And even if the key wasn't available, dx_show_leaf() uses fscrypt_fname_disk_to_usr() which knows how to create a no-key name. So for the above reasons, and because it would be desirable to stop exporting fscrypt_get_encryption_info() directly to filesystems, remove the call to fscrypt_get_encryption_info() from dx_show_leaf(). Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201203022041.230976-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-25ext4: prevent creating duplicate encrypted filenamesEric Biggers1-0/+3
As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to create a duplicate filename in an encrypted directory by creating a file concurrently with adding the directory's encryption key. Fix this bug on ext4 by rejecting no-key dentries in ext4_add_entry(). Note that the duplicate check in ext4_find_dest_de() sometimes prevented this bug. However in many cases it didn't, since ext4_find_dest_de() doesn't examine every dentry. Fixes: 4461471107b7 ("ext4 crypto: enable filename encryption") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201118075609.120337-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-07ext4: fixup ext4_fc_track_* functions' signatureHarshad Shirwadkar1-33/+28
Firstly, pass handle to all ext4_fc_track_* functions and use transaction id found in handle->h_transaction->h_tid for tracking fast commit updates. Secondly, don't pass inode to ext4_fc_track_link/create/unlink functions. inode can be found inside these functions as d_inode(dentry). However, rename path is an exeception. That's because in that case, we need inode that's not same as d_inode(dentry). To handle that, add a couple of low-level wrapper functions that take inode and dentry as arguments. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: use generic casefolding supportDaniel Rosenberg1-12/+8
This switches ext4 over to the generic support provided in libfs. Since casefolded dentries behave the same in ext4 and f2fs, we decrease the maintenance burden by unifying them, and any optimizations will immediately apply to both. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20201028050820.1636571-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-22Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-76/+130
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The siginificant new ext4 feature this time around is Harshad's new fast_commit mode. In addition, thanks to Mauricio for fixing a race where mmap'ed pages that are being changed in parallel with a data=journal transaction commit could result in bad checksums in the failure that could cause journal replays to fail. Also notable is Ritesh's buffered write optimization which can result in significant improvements on parallel write workloads. (The kernel test robot reported a 330.6% improvement on fio.write_iops on a 96 core system using DAX) Besides that, we have the usual miscellaneous cleanups and bug fixes" Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: fix invalid inode checksum ext4: add fast commit stats in procfs ext4: add a mount opt to forcefully turn fast commits on ext4: fast commit recovery path jbd2: fast commit recovery path ext4: main fast-commit commit path jbd2: add fast commit machinery ext4 / jbd2: add fast commit initialization ext4: add fast_commit feature and handling for extended mount options doc: update ext4 and journalling docs to include fast commit feature ext4: Detect already used quota file early jbd2: avoid transaction reuse after reformatting ext4: use the normal helper to get the actual inode ext4: fix bs < ps issue reported with dioread_nolock mount opt ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() ext4: data=journal: fixes for ext4_page_mkwrite() jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable() ext4: use ext4_sb_bread() instead of sb_bread() ...
2020-10-22ext4: fast commit recovery pathHarshad Shirwadkar1-63/+86
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-22ext4: main fast-commit commit pathHarshad Shirwadkar1-3/+34
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: remove unused argument from ext4_(inc|dec)_countNikolay Borisov1-10/+10
The 'handle' argument is not used for anything so simply remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200826133116.11592-1-nborisov@suse.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-09-08fscrypt: drop unused inode argument from fscrypt_fname_alloc_bufferJeff Layton1-4/+3
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200810142139.487631-1-jlayton@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-08-18ext4: remove unused parameter of ext4_generic_delete_entry functionKyoungho Koo1-4/+2
The ext4_generic_delete_entry function does not use the parameter handle, so it can be removed. Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07ext4: fix checking of directory entry validity for inline directoriesJan Kara1-3/+3
ext4_search_dir() and ext4_generic_delete_entry() can be called both for standard director blocks and for inline directories stored inside inode or inline xattr space. For the second case we didn't call ext4_check_dir_entry() with proper constraints that could result in accepting corrupted directory entry as well as false positive filesystem errors like: EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400: block 113246792: comm dockerd: bad entry in directory: directory entry too close to block end - offset=0, inode=28320403, rec_len=32, name_len=8, size=4096 Fix the arguments passed to ext4_check_dir_entry(). Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06ext4: lost matching-pair of trace in ext4_unlinkYi Zhuang1-17/+21
If dquot_initialize() return non-zero and trace of ext4_unlink_enter/exit enabled then the matching-pair of trace_exit will lost in log. Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200629122621.129953-1-zhuangyi1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06ext4: fix potential negative array index in do_split()Eric Sandeen1-3/+13
If for any reason a directory passed to do_split() does not have enough active entries to exceed half the size of the block, we can end up iterating over all "count" entries without finding a split point. In this case, count == move, and split will be zero, and we will attempt a negative index into map[]. Guard against this by detecting this case, and falling back to split-to-half-of-count instead; in this case we will still have plenty of space (> half blocksize) in each split block. Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04ext4: handle ext4_mark_inode_dirty errorsHarshad Shirwadkar1-27/+49
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return value may lead ext4 to ignore real failures that would result in corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail as soon as possible and return errors to the caller whenever appropriate. One of the possible scnearios when this bug could affected is that while creating a new inode, its directory entry gets added successfully but while writing the inode itself mark_inode_dirty returns error which is ignored. This would result in inconsistency that the directory entry points to a non-existent inode. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-02ext4: save all error info in save_error_info() and drop ext4_set_errno()Theodore Ts'o1-12/+12
Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.) Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-06ext4: use flexible-array members in struct dx_node and struct dx_rootGustavo A. R. Silva1-2/+2
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Link: https://lore.kernel.org/r/20200213160648.GA7054@embeddedor Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-20ext4: add cond_resched() to __ext4_find_entry()Shijie Luo1-0/+1
We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2020-02-13ext4: fix checksum errors with indexed dirsJan Kara1-0/+7
DIR_INDEX has been introduced as a compat ext4 feature. That means that even kernels / tools that don't understand the feature may modify the filesystem. This works because for kernels not understanding indexed dir format, internal htree nodes appear just as empty directory entries. Index dir aware kernels then check the htree structure is still consistent before using the data. This all worked reasonably well until metadata checksums were introduced. The problem is that these effectively made DIR_INDEX only ro-compatible because internal htree nodes store checksums in a different place than normal directory blocks. Thus any modification ignorant to DIR_INDEX (or just clearing EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch and trigger kernel errors. So we have to be more careful when dealing with indexed directories on filesystems with checksumming enabled. 1) We just disallow loading any directory inodes with EXT4_INDEX_FL when DIR_INDEX is not enabled. This is harsh but it should be very rare (it means someone disabled DIR_INDEX on existing filesystem and didn't run e2fsck), e2fsck can fix the problem, and we don't want to answer the difficult question: "Should we rather corrupt the directory more or should we ignore that DIR_INDEX feature is not set?" 2) When we find out htree structure is corrupted (but the filesystem and the directory should in support htrees), we continue just ignoring htree information for reading but we refuse to add new entries to the directory to avoid corrupting it more. Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-01-18ext4: remove unnecessary ifdefs in htree_dirblock_to_tree()Eric Biggers1-4/+1
The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are unnecessary, as the called functions are already stubbed out when !CONFIG_FS_ENCRYPTION. Remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-12-26ext4: simulate various I/O and checksum errors when reading metadataTheodore Ts'o1-3/+8
This allows us to test various error handling code paths Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: save the error code which triggered an ext4_error() in the superblockTheodore Ts'o1-0/+4
This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED. Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: fix unused-but-set-variable warning in ext4_add_entry()Yunfeng Ye1-1/+3
Warning is found when compile with "-Wunused-but-set-variable": fs/ext4/namei.c: In function ‘ext4_add_entry’: fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used [-Wunused-but-set-variable] struct ext4_sb_info *sbi; ^~~ Fix this by moving the variable @sbi under CONFIG_UNICODE. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-15ext4: fix ext4_empty_dir() for directories with holesJan Kara1-14/+18
Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-19ext4: work around deleting a file with i_nlink == 0 safelyTheodore Ts'o1-6/+5
If the file system is corrupted such that a file's i_links_count is too small, then it's possible that when unlinking that file, i_nlink will already be zero. Previously we were working around this kind of corruption by forcing i_nlink to one; but we were doing this before trying to delete the directory entry --- and if the file system is corrupted enough that ext4_delete_entry() fails, then we exit with i_nlink elevated, and this causes the orphan inode list handling to be FUBAR'ed, such that when we unmount the file system, the orphan inode list can get corrupted. A better way to fix this is to simply skip trying to call drop_nlink() if i_nlink is already zero, thus moving the check to the place where it makes the most sense. https://bugzilla.kernel.org/show_bug.cgi?id=205433 Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-11-05ext4: Do not iput inode under running transactionJan Kara1-6/+23
When ext4_mkdir(), ext4_symlink(), ext4_create(), or ext4_mknod() fail to add entry into directory, it ends up dropping freshly created inode under the running transaction and thus inode truncation happens under that transaction. That breaks assumptions that evict() does not get called from a transaction context and at least in ext4_symlink() case it can result in inode eviction deadlocking in inode_wait_for_writeback() when flush worker finds symlink inode, starts to write it back and blocks on starting a transaction. So change the code in ext4_mkdir() and ext4_add_nondir() to drop inode reference only after the transaction is stopped. We also have to add inode to the orphan list in that case as otherwise the inode would get leaked in case we crash before inode deletion is committed. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: Move marking of handle as sync to ext4_add_nondir()Jan Kara1-7/+3
Every caller of ext4_add_nondir() marks handle as sync if directory has DIRSYNC set. Move this marking to ext4_add_nondir() so reduce some duplication. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-03ext4: fix kernel oops caused by spurious casefold flagTheodore Ts'o1-2/+2
If an directory has the a casefold flag set without the casefold feature set, s_encoding will not be initialized, and this will cause the kernel to dereference a NULL pointer. In addition to adding checks to avoid these kernel oops, attempts to load inodes with the casefold flag when the casefold feature is not enable will cause the file system to be declared corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-07-03ext4: fix coverity warning on error path of filename setupGabriel Krisman Bertazi1-4/+9
Fix the following coverity warning reported by Dan Carpenter: fs/ext4/namei.c:1311 ext4_fname_setup_ci_filename() warn: 'cf_name->len' unsigned <= 0 Fixes: 3ae72562ad91 ("ext4: optimize case-insensitive lookups") Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
2019-06-22ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()Theodore Ts'o1-4/+4
Clean up namespace pollution by the inline_data code. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: refactor initialize_dirent_tail()Theodore Ts'o1-33/+21
Move the calculation of the location of the dirent tail into initialize_dirent_tail(). Also prefix the function with ext4_ to fix kernel namepsace polution. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: rename "dirent_csum" functions to use "dirblock"Theodore Ts'o1-33/+29
Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set() don't actually operate on a directory entry, but a directory block. And while they take a struct ext4_dir_entry *dirent as an argument, it had better be the first directory at the beginning of the direct block, or things will go very wrong. Rename the following functions so that things make more sense, and remove a lot of confusing casts along the way: ext4_dirent_csum_verify -> ext4_dirblock_csum_verify ext4_dirent_csum_set -> ext4_dirblock_csum_set ext4_dirent_csum -> ext4_dirblock_csum ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: allow directory holesTheodore Ts'o1-8/+37
The largedir feature was intended to allow ext4 directories to have unmapped directory blocks (e.g., directory holes). And so the released e2fsprogs no longer enforces this for largedir file systems; however, the corresponding change to the kernel-side code was not made. This commit fixes this oversight. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2019-06-20ext4: optimize case-insensitive lookupsGabriel Krisman Bertazi1-5/+38
Temporarily cache a casefolded version of the file name under lookup in ext4_filename, to avoid repeatedly casefolding it. I got up to 30% speedup on lookups of large directories (>100k entries), depending on the length of the string under lookup. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-05-19Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some bug fixes, and an update to the URL's for the final version of Unicode 12.1.0" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid panic during forced reboot due to aborted journal ext4: fix block validity checks for journal inodes using indirect blocks unicode: update to Unicode 12.1.0 final unicode: add missing check for an error return from utf8lookup() ext4: fix miscellaneous sparse warnings ext4: unsigned int compared against zero ext4: fix use-after-free in dx_release() ext4: fix data corruption caused by overlapping unaligned and aligned IO jbd2: fix potential double free ext4: zero out the unused memory region in the extent tree block
2019-05-11ext4: fix use-after-free in dx_release()Sahitya Tummala1-1/+4
The buffer_head (frames[0].bh) and it's corresping page can be potentially free'd once brelse() is done inside the for loop but before the for loop exits in dx_release(). It can be free'd in another context, when the page cache is flushed via drop_caches_sysctl_handler(). This results into below data abort when accessing info->indirect_levels in dx_release(). Unable to handle kernel paging request at virtual address ffffffc17ac3e01e Call trace: dx_release+0x70/0x90 ext4_htree_fill_tree+0x2d4/0x300 ext4_readdir+0x244/0x6f8 iterate_dir+0xbc/0x160 SyS_getdents64+0x94/0x174 Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
2019-05-08Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-24/+52
Pull fscrypt updates from Ted Ts'o: "Clean up fscrypt's dcache revalidation support, and other miscellaneous cleanups" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: cache decrypted symlink target in ->i_link vfs: use READ_ONCE() to access ->i_link fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext fscrypt: only set dentry_operations on ciphertext dentries fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory fscrypt: fix race allowing rename() and link() of ciphertext dentries fscrypt: clean up and improve dentry revalidation fscrypt: use READ_ONCE() to access ->i_crypt_info fscrypt: remove WARN_ON_ONCE() when decryption fails fscrypt: drop inode argument from fscrypt_get_ctx()
2019-04-25ext4: Support case-insensitive file name lookupsGabriel Krisman Bertazi1-11/+96
This patch implements the actual support for case-insensitive file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in ext4 is implemented lives in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for EXT4 because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-17fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertextEric Biggers1-24/+52
->lookup() in an encrypted directory begins as follows: 1. fscrypt_prepare_lookup(): a. Try to load the directory's encryption key. b. If the key is unavailable, mark the dentry as a ciphertext name via d_flags. 2. fscrypt_setup_filename(): a. Try to load the directory's encryption key. b. If the key is available, encrypt the name (treated as a plaintext name) to get the on-disk name. Otherwise decode the name (treated as a ciphertext name) to get the on-disk name. But if the key is concurrently added, it may be found at (2a) but not at (1a). In this case, the dentry will be wrongly marked as a ciphertext name even though it was actually treated as plaintext. This will cause the dentry to be wrongly invalidated on the next lookup, potentially causing problems. For example, if the racy ->lookup() was part of sys_mount(), then the new mount will be detached when anything tries to access it. This is despite the mountpoint having a plaintext path, which should remain valid now that the key was added. Of course, this is only possible if there's a userspace race. Still, the additional kernel-side race is confusing and unexpected. Close the kernel-side race by changing fscrypt_prepare_lookup() to also set the on-disk filename (step 2b), consistent with the d_flags update. Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-01-24fscrypt: remove filesystem specific build config optionChandan Rajendra1-5/+5
In order to have a common code base for fscrypt "post read" processing for all filesystems which support encryption, this commit removes filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION) and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose value affects all the filesystems making use of fscrypt. Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-24ext4: use IS_ENCRYPTED() to check encryption statusChandan Rajendra1-4/+4
This commit removes the ext4 specific ext4_encrypted_inode() and makes use of the generic IS_ENCRYPTED() macro to check for the encryption status of an inode. Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2018-12-19ext4: avoid declaring fs inconsistent due to invalid file handlesTheodore Ts'o1-2/+2
If we receive a file handle, either from NFS or open_by_handle_at(2), and it points at an inode which has not been initialized, and the file system has metadata checksums enabled, we shouldn't try to get the inode, discover the checksum is invalid, and then declare the file system as being inconsistent. This can be reproduced by creating a test file system via "mke2fs -t ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that directory, and then running the following program. #define _GNU_SOURCE #include <fcntl.h> struct handle { struct file_handle fh; unsigned char fid[MAX_HANDLE_SZ]; }; int main(int argc, char **argv) { struct handle h = {{8, 1 }, { 12, }}; open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY); return 0; } Google-Bug-Id: 120690101 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org