summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-11-17erofs: don't trigger WARN() when decompression failsGao Xiang1-1/+0
[ Upstream commit a0961f351d82d43ab0b845304caa235dfe249ae9 ] syzbot reported a WARNING [1] due to corrupted compressed data. As Dmitry said, "If this is not a kernel bug, then the code should not use WARN. WARN if for kernel bugs and is recognized as such by all testing systems and humans." [1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com Cc: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17btrfs: do not take the uuid_mutex in btrfs_rm_deviceJosef Bacik1-5/+5
[ Upstream commit 8ef9dc0f14ba6124c62547a4fdc59b163d8b864e ] We got the following lockdep splat while running fstests (specifically btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which converted loop to using workqueues, which comes with lockdep annotations that don't exist with kworkers. The lockdep splat is as follows: WARNING: possible circular locking dependency detected 5.14.0-rc2-custom+ #34 Not tainted ------------------------------------------------------ losetup/156417 is trying to acquire lock: ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600 but task is already holding lock: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x28/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x163/0x3a0 path_openat+0x74d/0xa40 do_filp_open+0x9c/0x140 do_sys_openat2+0xb1/0x170 __x64_sys_openat+0x54/0x90 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 blkdev_get_by_dev.part.0+0xd1/0x3c0 blkdev_get_by_path+0xc0/0xd0 btrfs_scan_one_device+0x52/0x1f0 [btrfs] btrfs_control_ioctl+0xac/0x170 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (uuid_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 btrfs_rm_device+0x48/0x6a0 [btrfs] btrfs_ioctl+0x2d1c/0x3110 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#11){.+.+}-{0:0}: lo_write_bvec+0x112/0x290 [loop] loop_process_work+0x25f/0xcb0 [loop] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x266/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 flush_workqueue+0xae/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/156417: #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] stack backtrace: CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0x10a/0x120 __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 ? flush_workqueue+0x84/0x600 flush_workqueue+0xae/0x600 ? flush_workqueue+0x84/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] ? __lock_acquire+0x3a0/0x1dc0 ? update_dl_rq_load_avg+0x152/0x360 ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f645884de6b Usually the uuid_mutex exists to protect the fs_devices that map together all of the devices that match a specific uuid. In rm_device we're messing with the uuid of a device, so it makes sense to protect that here. However in doing that it pulls in a whole host of lockdep dependencies, as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus we end up with the dependency chain under the uuid_mutex being added under the normal sb write dependency chain, which causes problems with loop devices. We don't need the uuid mutex here however. If we call btrfs_scan_one_device() before we scratch the super block we will find the fs_devices and not find the device itself and return EBUSY because the fs_devices is open. If we call it after the scratch happens it will not appear to be a valid btrfs file system. We do not need to worry about other fs_devices modifying operations here because we're protected by the exclusive operations locking. So drop the uuid_mutex here in order to fix the lockdep splat. A more detailed explanation from the discussion: We are worried about rm and scan racing with each other, before this change we'll zero the device out under the UUID mutex so when scan does run it'll make sure that it can go through the whole device scan thing without rm messing with us. We aren't worried if the scratch happens first, because the result is we don't think this is a btrfs device and we bail out. The only case we are concerned with is we scratch _after_ scan is able to read the superblock and gets a seemingly valid super block, so lets consider this case. Scan will call device_list_add() with the device we're removing. We'll call find_fsid_with_metadata_uuid() and get our fs_devices for this UUID. At this point we lock the fs_devices->device_list_mutex. This is what protects us in this case, but we have two cases here. 1. We aren't to the device removal part of the RM. We found our device, and device name matches our path, we go down and we set total_devices to our super number of devices, which doesn't affect anything because we haven't done the remove yet. 2. We are past the device removal part, which is protected by the device_list_mutex. Scan doesn't find the device, it goes down and does the if (fs_devices->opened) return -EBUSY; check and we bail out. Nothing about this situation is ideal, but the lockdep splat is real, and the fix is safe, tho admittedly a bit scary looking. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more from the discussion ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17btrfs: reflink: initialize return value to 0 in btrfs_extent_same()Sidong Yang1-1/+1
[ Upstream commit 44bee215f72f13874c0e734a0712c2e3264c0108 ] Fix a warning reported by smatch that ret could be returned without initialized. The dedupe operations are supposed to to return 0 for a 0 length range but the caller does not pass olen == 0. To keep this behaviour and also fix the warning initialize ret to 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17gfs2: Fix glock_hash_walk bugsAndreas Gruenbacher1-10/+12
[ Upstream commit 7427f3bb49d81525b7dd1d0f7c5f6bbc752e6f0e ] So far, glock_hash_walk took a reference on each glock it iterated over, and it was the examiner's responsibility to drop those references. Dropping the final reference to a glock can sleep and the examiners are called in a RCU critical section with spin locks held, so examiners that didn't need the extra reference had to drop it asynchronously via gfs2_glock_queue_put or similar. This wasn't done correctly in thaw_glock which did call gfs2_glock_put, and not at all in dump_glock_func. Change glock_hash_walk to not take glock references at all. That way, the examiners that don't need them won't have to bother with slow asynchronous puts, and the examiners that do need references can take them themselves. Reported-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17gfs2: Cancel remote delete work asynchronouslyAndreas Gruenbacher1-1/+1
[ Upstream commit 486408d690e130c3adacf816754b97558d715f46 ] In gfs2_inode_lookup and gfs2_create_inode, we're calling gfs2_cancel_delete_work which currently cancels any remote delete work (delete_work_func) synchronously. This means that if the work is currently running, it will wait for it to finish. We're doing this to pevent a previous instance of an inode from having any influence on the next instance. However, delete_work_func uses gfs2_inode_lookup internally, and we can end up in a deadlock when delete_work_func gets interrupted at the wrong time. For example, (1) An inode's iopen glock has delete work queued, but the inode itself has been evicted from the inode cache. (2) The delete work is preempted before reaching gfs2_inode_lookup. (3) Another process recreates the inode (gfs2_create_inode). It tries to cancel any outstanding delete work, which blocks waiting for the ongoing delete work to finish. (4) The delete work calls gfs2_inode_lookup, which blocks waiting for gfs2_create_inode to instantiate and unlock the new inode => deadlock. It turns out that when the delete work notices that its inode has been re-instantiated, it will do nothing. This means that it's safe to cancel the delete work asynchronously. This prevents the kind of deadlock described above. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17tracefs: Have tracefs directories not set OTH permission bits by defaultSteven Rostedt (VMware)1-1/+2
[ Upstream commit 49d67e445742bbcb03106b735b2ab39f6e5c56bc ] The tracefs file system is by default mounted such that only root user can access it. But there are legitimate reasons to create a group and allow those added to the group to have access to tracing. By changing the permissions of the tracefs mount point to allow access, it will allow group access to the tracefs directory. There should not be any real reason to allow all access to the tracefs directory as it contains sensitive information. Have the default permission of directories being created not have any OTH (other) bits set, such that an admin that wants to give permission to a group has to first disable all OTH bits in the file system. Link: https://lkml.kernel.org/r/20210818153038.664127804@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17fs/proc/uptime.c: Fix idle time reporting in /proc/uptimeJosh Don2-7/+11
[ Upstream commit a130e8fbc7de796eb6e680724d87f4737a26d0ac ] /proc/uptime reports idle time by reading the CPUTIME_IDLE field from the per-cpu kcpustats. However, on NO_HZ systems, idle time is not continually updated on idle cpus, leading this value to appear incorrectly small. /proc/stat performs an accounting update when reading idle time; we can use the same approach for uptime. With this patch, /proc/stat and /proc/uptime now agree on idle time. Additionally, the following shows idle time tick up consistently on an idle machine: (while true; do cat /proc/uptime; sleep 1; done) | awk '{print $2-prev; prev=$2}' Reported-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lkml.kernel.org/r/20210827165438.3280779-1-joshdon@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17fscrypt: allow 256-bit master keys with AES-256-XTSEric Biggers3-17/+56
[ Upstream commit 7f595d6a6cdc336834552069a2e0a4f6d4756ddf ] fscrypt currently requires a 512-bit master key when AES-256-XTS is used, since AES-256-XTS keys are 512-bit and fscrypt requires that the master key be at least as long any key that will be derived from it. However, this is overly strict because AES-256-XTS doesn't actually have a 512-bit security strength, but rather 256-bit. The fact that XTS takes twice the expected key size is a quirk of the XTS mode. It is sufficient to use 256 bits of entropy for AES-256-XTS, provided that it is first properly expanded into a 512-bit key, which HKDF-SHA512 does. Therefore, relax the check of the master key size to use the security strength of the derived key rather than the size of the derived key (except for v1 encryption policies, which don't use HKDF). Besides making things more flexible for userspace, this is needed in order for the use of a KDF which only takes a 256-bit key to be introduced into the fscrypt key hierarchy. This will happen with hardware-wrapped keys support, as all known hardware which supports that feature uses an SP800-108 KDF using AES-256-CMAC, so the wrapped keys are wrapped 256-bit AES keys. Moreover, there is interest in fscrypt supporting the same type of AES-256-CMAC based KDF in software as an alternative to HKDF-SHA512. There is no security problem with such features, so fix the key length check to work properly with them. Reviewed-by: Paul Crowley <paulcrowley@google.com> Link: https://lore.kernel.org/r/20210921030303.5598-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17cifs: set a minimum of 120s for next dns resolutionPaulo Alcantara2-2/+3
commit 4ac0536f8874a903a72bddc57eb88db774261e3a upstream. With commit 506c1da44fee ("cifs: use the expiry output of dns_query to schedule next resolution") and after triggering the first reconnect, the next async dns resolution of tcp server's hostname would be scheduled based on dns_resolver's key expiry default, which happens to default to 5s on most systems that use key.dns_resolver for upcall. As per key.dns_resolver.conf(5): default_ttl=<number> The number of seconds to set as the expiration on a cached record. This will be overridden if the program manages to re- trieve TTL information along with the addresses (if, for exam- ple, it accesses the DNS directly). The default is 5 seconds. The value must be in the range 1 to INT_MAX. Make the next async dns resolution no shorter than 120s as we do not want to be upcalling too often. Cc: stable@vger.kernel.org Fixes: 506c1da44fee ("cifs: use the expiry output of dns_query to schedule next resolution") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17cifs: To match file servers, make sure the server hostname matchesShyam Prasad N3-8/+20
commit 7be3248f313930ff3d3436d4e9ddbe9fccc1f541 upstream. We generally rely on a bunch of factors to differentiate between servers. For example, IP address, port etc. For certain server types (like Azure), it is important to make sure that the server hostname matches too, even if the both hostnames currently resolve to the same IP address. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17quota: correct error number in free_dqentry()Zhang Yi1-0/+1
commit d0e36a62bd4c60c09acc40e06ba4831a4d0bc75b upstream. Fix the error path in free_dqentry(), pass out the error number if the block to free is not correct. Fixes: 1ccd14b9c271 ("quota: Split off quota tree handling into a separate file") Link: https://lore.kernel.org/r/20211008093821.1001186-3-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17quota: check block number when reading the block in quota fileZhang Yi1-0/+14
commit 9bf3d20331295b1ecb81f4ed9ef358c51699a050 upstream. The block number in the quota tree on disk should be smaller than the v2_disk_dqinfo.dqi_blocks. If the quota file was corrupted, we may be allocating an 'allocated' block and that would lead to a loop in a tree, which will probably trigger oops later. This patch adds a check for the block number in the quota tree to prevent such potential issue. Link: https://lore.kernel.org/r/20211008093821.1001186-2-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17ovl: fix use after free in struct ovl_aio_reqyangerkun1-2/+14
commit 9a254403760041528bc8f69fe2f5e1ef86950991 upstream. Example for triggering use after free in a overlay on ext4 setup: aio_read ovl_read_iter vfs_iter_read ext4_file_read_iter ext4_dio_read_iter iomap_dio_rw -> -EIOCBQUEUED /* * Here IO is completed in a separate thread, * ovl_aio_cleanup_handler() frees aio_req which has iocb embedded */ file_accessed(iocb->ki_filp); /**BOOM**/ Fix by introducing a refcount in ovl_aio_req similarly to aio_kiocb. This guarantees that iocb is only freed after vfs_read/write_iter() returns on underlying fs. Fixes: 2406a307ac7d ("ovl: implement async IO routines") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20210930032228.3199690-3-yangerkun@huawei.com/ Cc: <stable@vger.kernel.org> # v5.6 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17btrfs: call btrfs_check_rw_degradable only if there is a missing deviceAnand Jain1-1/+2
commit 5c78a5e7aa835c4f08a7c90fe02d19f95a776f29 upstream. In open_ctree() in btrfs_check_rw_degradable() [1], we check each block group individually if at least the minimum number of devices is available for that profile. If all the devices are available, then we don't have to check degradable. [1] open_ctree() :: 3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) { Also before calling btrfs_check_rw_degradable() in open_ctee() at the line number shown below [2] we call btrfs_read_chunk_tree() and down to add_missing_dev() to record number of missing devices. [2] open_ctree() :: 3454 ret = btrfs_read_chunk_tree(fs_info); btrfs_read_chunk_tree() read_one_chunk() / read_one_dev() add_missing_dev() So, check if there is any missing device before btrfs_check_rw_degradable() in open_ctree(). Also, with this the mount command could save ~16ms.[3] in the most common case, that is no device is missing. [3] 1) * 16934.96 us | btrfs_check_rw_degradable [btrfs](); CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17btrfs: fix lost error handling when replaying directory deletesFilipe Manana1-1/+3
commit 10adb1152d957a4d570ad630f93a88bb961616c1 upstream. At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17btrfs: clear MISSING device status bit in btrfs_close_one_deviceLi Zhang1-1/+3
commit 5d03dbebba2594d2e6fbf3b5dd9060c5a835de3b upstream. Reported bug: https://github.com/kdave/btrfs-progs/issues/389 There's a problem with scrub reporting aborted status but returning error code 0, on a filesystem with missing and readded device. Roughly these steps: - mkfs -d raid1 dev1 dev2 - fill with data - unmount - make dev1 disappear - mount -o degraded - copy more data - make dev1 appear again Running scrub afterwards reports that the command was aborted, but the system log message says the exit code was 0. It seems that the cause of the error is decrementing fs_devices->missing_devices but not clearing device->dev_state. Every time we umount filesystem, it would call close_ctree, And it would eventually involve btrfs_close_one_device to close the device, but it only decrements fs_devices->missing_devices but does not clear the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer Overflow, because every time umount, fs_devices->missing_devices will decrease. If fs_devices->missing_devices value hit 0, it would overflow. With added debugging: loop1: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311) loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615 If fs_devices->missing_devices is 0, next time it would be 18446744073709551615 After apply this patch, the fs_devices->missing_devices seems to be right: $ truncate -s 10g test1 $ truncate -s 10g test2 $ losetup /dev/loop1 test1 $ losetup /dev/loop2 test2 $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f $ losetup -d /dev/loop2 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ dmesg loop1: detected capacity change from 0 to 20971520 loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863) BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): checking UUID tree BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17fuse: fix page stealingMiklos Szeredi1-2/+12
commit 712a951025c0667ff00b25afc360f74e639dfabe upstream. It is possible to trigger a crash by splicing anon pipe bufs to the fuse device. The reason for this is that anon_pipe_buf_release() will reuse buf->page if the refcount is 1, but that page might have already been stolen and its flags modified (e.g. PG_lru added). This happens in the unlikely case of fuse_dev_splice_write() getting around to calling pipe_buf_release() after a page has been stolen, added to the page cache and removed from the page cache. Fix by calling pipe_buf_release() right after the page was inserted into the page cache. In this case the page has an elevated refcount so any release function will know that the page isn't reusable. Reported-by: Frank Dinoff <fdinoff@google.com> Link: https://lore.kernel.org/r/CAAmZXrsGg2xsP1CK+cbuEMumtrqdvD-NKnWzhNcvn71RV3c1yw@mail.gmail.com/ Fixes: dd3bb14f44a6 ("fuse: support splice() writing to fuse device") Cc: <stable@vger.kernel.org> # v2.6.35 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17ext4: refresh the ext4_ext_path struct after dropping i_data_sem.yangerkun1-1/+13
commit 1811bc401aa58c7bdb0df3205aa6613b49d32127 upstream. After we drop i_data sem, we need to reload the ext4_ext_path structure since the extent tree can change once i_data_sem is released. This addresses the BUG: [52117.465187] ------------[ cut here ]------------ [52117.465686] kernel BUG at fs/ext4/extents.c:1756! ... [52117.478306] Call Trace: [52117.478565] ext4_ext_shift_extents+0x3ee/0x710 [52117.479020] ext4_fallocate+0x139c/0x1b40 [52117.479405] ? __do_sys_newfstat+0x6b/0x80 [52117.479805] vfs_fallocate+0x151/0x4b0 [52117.480177] ksys_fallocate+0x4a/0xa0 [52117.480533] __x64_sys_fallocate+0x22/0x30 [52117.480930] do_syscall_64+0x35/0x80 [52117.481277] entry_SYSCALL_64_after_hwframe+0x44/0xae [52117.481769] RIP: 0033:0x7fa062f855ca Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210903062748.4118886-4-yangerkun@huawei.com Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17ext4: ensure enough credits in ext4_ext_shift_path_extentsyangerkun1-34/+15
commit 4268496e48dc681cfa53b92357314b5d7221e625 upstream. Like ext4_ext_rm_leaf, we can ensure that there are enough credits before every call that will consume credits. As part of this fix we fold the functionality of ext4_access_path() into ext4_ext_shift_path_extents(). This change is needed as a preparation for the next bugfix patch. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210903062748.4118886-3-yangerkun@huawei.com Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17ext4: fix lazy initialization next schedule time computation in more ↵Shaoying Xu1-5/+4
granular unit commit 39fec6889d15a658c3a3ebb06fd69d3584ddffd3 upstream. Ext4 file system has default lazy inode table initialization setup once it is mounted. However, it has issue on computing the next schedule time that makes the timeout same amount in jiffies but different real time in secs if with various HZ values. Therefore, fix by measuring the current time in a more granular unit nanoseconds and make the next schedule time independent of the HZ value. Fixes: bfff68738f1c ("ext4: add support for lazy inode table initialization") Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210902164412.9994-2-shaoyi@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17exfat: fix incorrect loading of i_blocks for large filesSungjong Seo1-1/+1
commit 0c336d6e33f4bedc443404c89f43c91c8bd9ee11 upstream. When calculating i_blocks, there was a mistake that was masked with a 32-bit variable. So i_blocks for files larger than 4 GiB had incorrect values. Mask with a 64-bit variable instead of 32-bit one. Fixes: 5f2aa075070c ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7+ Reported-by: Ganapathi Kamath <hgkamath@hotmail.com> Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17ocfs2: fix data corruption on truncateJan Kara1-2/+6
commit 839b63860eb3835da165642923120d305925561d upstream. Patch series "ocfs2: Truncate data corruption fix". As further testing has shown, commit 5314454ea3f ("ocfs2: fix data corruption after conversion from inline format") didn't fix all the data corruption issues the customer started observing after 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") This time I have tracked them down to two bugs in ocfs2 truncation code. One bug (truncating page cache before clearing tail cluster and setting i_size) could cause data corruption even before 6dbf7bb55598, but before that commit it needed a race with page fault, after 6dbf7bb55598 it started to be pretty deterministic. Another bug (zeroing pages beyond old i_size) used to be harmless inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598 in combination with the first bug it resulted in deterministic data corruption. Although fixing only the first problem is needed to stop data corruption, I've fixed both issues to make the code more robust. This patch (of 2): ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing partial tail cluster and setting i_size. Thus some pages could be left (and likely have left if the cluster zeroing happened) in the page cache beyond i_size after truncate finished letting user possibly see stale data once the file was extended again. Also the tail cluster zeroing was not guaranteed to finish before truncate finished causing possible stale data exposure. The problem started to be particularly easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers in block_write_full_page()" stopped invalidation of pages beyond i_size from page writeback path. Fix these problems by unmapping and invalidating pages in the page cache after the i_size is reduced and tail cluster is zeroed out. Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-12isofs: Fix out of bound access for corrupted isofs imageJan Kara1-0/+2
commit e96a1866b40570b5950cda8602c2819189c62a48 upstream. When isofs image is suitably corrupted isofs_read_inode() can read data beyond the end of buffer. Sanity-check the directory entry length before using it. Reported-and-tested-by: syzbot+6fc7fb214625d82af7d1@syzkaller.appspotmail.com CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-12Revert "proc/wchan: use printk format instead of lookup_symbol_name()"Kees Cook1-8/+11
commit 54354c6a9f7fd5572d2b9ec108117c4f376d4d23 upstream. This reverts commit 152c432b128cb043fc107e8f211195fe94b2159c. When a kernel address couldn't be symbolized for /proc/$pid/wchan, it would leak the raw value, a potential information exposure. This is a regression compared to the safer pre-v5.12 behavior. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Vito Caputo <vcaputo@pengaru.com> Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211008111626.090829198@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-02ocfs2: fix race between searching chunks and release journal_head from ↵Gautham Ananthakrishna1-9/+13
buffer_head commit 6f1b228529ae49b0f85ab89bcdb6c365df401558 upstream. Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race. Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <rajesh.sivaramasubramaniom@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27autofs: fix wait name hash calculation in autofs_wait()Ian Kent1-1/+1
commit 25f54d08f12feb593e62cc2193fedefaf7825301 upstream. There's a mistake in commit 2be7828c9fefc ("get rid of autofs_getpath()") that affects kernels from v5.13.0, basically missed because of me not fully testing the change for Al. The problem is that the hash calculation for the wait name qstr hasn't been updated to account for the change to use dentry_path_raw(). This prevents the correct matching an existing wait resulting in multiple notifications being sent to the daemon for the same mount which must not occur. The problem wasn't discovered earlier because it only occurs when multiple processes trigger a request for the same mount concurrently so it only shows up in more aggressive testing. Fixes: 2be7828c9fefc ("get rid of autofs_getpath()") Cc: stable@vger.kernel.org Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27btrfs: deal with errors when checking if a dir entry exists during log replayFilipe Manana1-18/+29
[ Upstream commit 77a5b9e3d14cbce49ceed2766b2003c034c066dc ] Currently inode_in_dir() ignores errors returned from btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating any errors as if the directory entry does not exists in the fs/subvolume tree, which is obviously not correct, as we can get errors such as -EIO when reading extent buffers while searching the fs/subvolume's tree. Fix that by making inode_in_dir() return the errors and making its only caller, add_inode_ref(), deal with returned errors as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27vfs: check fd has read access in kernel_read_file_from_fd()Matthew Wilcox (Oracle)1-1/+1
commit 032146cda85566abcd1c4884d9d23e4e30a07e9a upstream. If we open a file without read access and then pass the fd to a syscall whose implementation calls kernel_read_file_from_fd(), we get a warning from __kernel_read(): if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ))) This currently affects both finit_module() and kexec_file_load(), but it could affect other syscalls in the future. Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Hao Sun <sunhao.th@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27userfaultfd: fix a race between writeprotect and exit_mmap()Nadav Amit1-3/+9
commit cb185d5f1ebf900f4ae3bf84cee212e6dd035aca upstream. A race is possible when a process exits, its VMAs are removed by exit_mmap() and at the same time userfaultfd_writeprotect() is called. The race was detected by KASAN on a development kernel, but it appears to be possible on vanilla kernels as well. Use mmget_not_zero() to prevent the race as done in other userfaultfd operations. Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Signed-off-by: Nadav Amit <namit@vmware.com> Tested-by: Li Wang <liwang@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ocfs2: mount fails with buffer overflow in strlenValentin Vidic1-4/+10
commit b15fa9224e6e1239414525d8d556d824701849fc upstream. Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the trace below. Problem seems to be that strings for cluster stack and cluster name are not guaranteed to be null terminated in the disk representation, while strlcpy assumes that the source string is always null terminated. This causes a read outside of the source string triggering the buffer overflow detection. detected buffer overflow in strlen ------------[ cut here ]------------ kernel BUG at lib/string.c:1149! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1 Debian 5.14.6-2 RIP: 0010:fortify_panic+0xf/0x11 ... Call Trace: ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2] ocfs2_fill_super+0x359/0x19b0 [ocfs2] mount_bdev+0x185/0x1b0 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x454/0xa20 __x64_sys_mount+0x103/0x140 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ocfs2: fix data corruption after conversion from inline formatJan Kara1-34/+12
commit 5314454ea3ff6fc746eaf71b9a7ceebed52888fa upstream. Commit 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") uncovered a latent bug in ocfs2 conversion from inline inode format to a normal inode format. The code in ocfs2_convert_inline_data_to_extents() attempts to zero out the whole cluster allocated for file data by grabbing, zeroing, and dirtying all pages covering this cluster. However these pages are beyond i_size, thus writeback code generally ignores these dirty pages and no blocks were ever actually zeroed on the disk. This oversight was fixed by commit 693c241a5f6a ("ocfs2: No need to zero pages past i_size.") for standard ocfs2 write path, inline conversion path was apparently forgotten; the commit log also has a reasoning why the zeroing actually is not needed. After commit 6dbf7bb55598, things became worse as writeback code stopped invalidating buffers on pages beyond i_size and thus these pages end up with clean PageDirty bit but with buffers attached to these pages being still dirty. So when a file is converted from inline format, then writeback triggers, and then the file is grown so that these pages become valid, the invalid dirtiness state is preserved, mark_buffer_dirty() does nothing on these pages (buffers are already dirty) but page is never written back because it is clean. So data written to these pages is lost once pages are reclaimed. Simple reproducer for the problem is: xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \ -c "pwrite 4000 2000" ocfs2_file After unmounting and mounting the fs again, you can observe that end of 'ocfs2_file' has lost its contents. Fix the problem by not doing the pointless zeroing during conversion from inline format similarly as in the standard write path. [akpm@linux-foundation.org: fix whitespace, per Joseph] Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz Fixes: 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: "Markov, Andrey" <Markov.Andrey@Dell.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ceph: fix handling of "meta" errorsJeff Layton5-27/+8
commit 1bd85aa65d0e7b5e4d09240f492f37c569fdd431 upstream. Currently, we check the wb_err too early for directories, before all of the unsafe child requests have been waited on. In order to fix that we need to check the mapping->wb_err later nearer to the end of ceph_fsync. We also have an overly-complex method for tracking errors after blocklisting. The errors recorded in cleanup_session_requests go to a completely separate field in the inode, but we end up reporting them the same way we would for any other error (in fsync). There's no real benefit to tracking these errors in two different places, since the only reporting mechanism for them is in fsync, and we'd need to advance them both every time. Given that, we can just remove i_meta_err, and convert the places that used it to instead just use mapping->wb_err instead. That also fixes the original problem by ensuring that we do a check_and_advance of the wb_err at the end of the fsync op. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52864 Reported-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ceph: skip existing superblocks that are blocklisted or shut down when mountingJeff Layton1-3/+14
commit 98d0a6fb7303a6f4a120b8b8ed05b86ff5db53e8 upstream. Currently when mounting, we may end up finding an existing superblock that corresponds to a blocklisted MDS client. This means that the new mount ends up being unusable. If we've found an existing superblock with a client that is already blocklisted, and the client is not configured to recover on its own, fail the match. Ditto if the superblock has been forcibly unmounted. While we're in here, also rename "other" to the more conventional "fsc". Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27NFSD: Keep existing listeners on portlist errorBenjamin Coddington1-1/+4
[ Upstream commit c20106944eb679fa3ab7e686fe5f6ba30fbc51e5 ] If nfsd has existing listening sockets without any processes, then an error returned from svc_create_xprt() for an additional transport will remove those existing listeners. We're seeing this in practice when userspace attempts to create rpcrdma transports without having the rpcrdma modules present before creating nfsd kernel processes. Fix this by checking for existing sockets before calling nfsd_destroy(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-20btrfs: fix abort logic in btrfs_replace_file_extentsJosef Bacik1-7/+9
commit 4afb912f439c4bc4e6a4f3e7547f2e69e354108f upstream. Error injection testing uncovered a case where we'd end up with a corrupt file system with a missing extent in the middle of a file. This occurs because the if statement to decide if we should abort is wrong. The only way we would abort in this case is if we got a ret != -EOPNOTSUPP and we called from the file clone code. However the prealloc code uses this path too. Instead we need to abort if there is an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only if we came from the clone file code. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20btrfs: update refs for any root except tree log rootsJosef Bacik1-2/+1
commit d175209be04d7d263fa1a54cde7608c706c9d0d7 upstream. I hit a stuck relocation on btrfs/061 during my overnight testing. This turned out to be because we had left over extent entries in our extent root for a data reloc inode that no longer existed. This happened because in btrfs_drop_extents() we only update refs if we have SHAREABLE set or we are the tree_root. This regression was introduced by aeb935a45581 ("btrfs: don't set SHAREABLE flag for data reloc tree") where we stopped setting SHAREABLE for the data reloc tree. The problem here is we actually do want to update extent references for data extents in the data reloc tree, in fact we only don't want to update extent references if the file extents are in the log tree. Update this check to only skip updating references in the case of the log tree. This is relatively rare, because you have to be running scrub at the same time, which is what btrfs/061 does. The data reloc inode has its extents pre-allocated, and then we copy the extent into the pre-allocated chunks. We theoretically should never be calling btrfs_drop_extents() on a data reloc inode. The exception of course is with scrub, if our pre-allocated extent falls inside of the block group we are scrubbing, then the block group will be marked read only and we will be forced to cow that extent. This means we will call btrfs_drop_extents() on that range when we COW that file extent. This isn't really problematic if we do this, the data reloc inode requires that our extent lengths match exactly with the extent we are copying, thankfully we validate the extent is correct with get_new_location(), so if we happen to COW only part of the extent we won't link it in when we do the relocation, so we are safe from any other shenanigans that arise because of this interaction with scrub. Fixes: aeb935a45581 ("btrfs: don't set SHAREABLE flag for data reloc tree") CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20btrfs: check for error when looking up inode during dir entry replayFilipe Manana1-7/+7
commit cfd312695b71df04c3a2597859ff12c470d1e2e4 upstream. At replay_one_name(), we are treating any error from btrfs_lookup_inode() as if the inode does not exists. Fix this by checking for an error and returning it to the caller. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20btrfs: deal with errors when adding inode reference during log replayFilipe Manana1-2/+7
commit 52db77791fe24538c8aa2a183248399715f6b380 upstream. At __inode_add_ref(), we treating any error returned from btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning that there is no existing directory entry in the fs/subvolume tree. This is not correct since we can get errors such as, for example, -EIO when reading extent buffers while searching the fs/subvolume's btree. So fix that and return the error to the caller when it is not -ENOENT. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20btrfs: deal with errors when replaying dir entry during log replayFilipe Manana1-1/+8
commit e15ac6413745e3def00e663de00aea5a717311c1 upstream. At replay_one_one(), we are treating any error returned from btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning that there is no existing directory entry in the fs/subvolume tree. This is not correct since we can get errors such as, for example, -EIO when reading extent buffers while searching the fs/subvolume's btree. So fix that and return the error to the caller when it is not -ENOENT. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20btrfs: unlock newly allocated extent buffer after errorQu Wenruo1-0/+1
commit 19ea40dddf1833db868533958ca066f368862211 upstream. [BUG] There is a bug report that injected ENOMEM error could leave a tree block locked while we return to user-space: BTRFS info (device loop0): enabling ssd optimizations FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106 fail_dump lib/fault-inject.c:52 [inline] should_fail+0x13c/0x160 lib/fault-inject.c:146 should_failslab+0x5/0x10 mm/slab_common.c:1328 slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494 slab_alloc_node mm/slub.c:3120 [inline] slab_alloc mm/slub.c:3214 [inline] kmem_cache_alloc+0x44/0x280 mm/slub.c:3219 btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline] btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833 __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415 btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570 btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768 btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905 btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530 btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783 lookup_open+0x660/0x780 fs/namei.c:3282 open_last_lookups fs/namei.c:3352 [inline] path_openat+0x465/0xe20 fs/namei.c:3557 do_filp_open+0xe3/0x170 fs/namei.c:3588 do_sys_openat2+0x357/0x4a0 fs/open.c:1200 do_sys_open+0x87/0xd0 fs/open.c:1216 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x46ae99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99 RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800 RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017 R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0 ================================================ WARNING: lock held when returning to user space! 5.15.0-rc1 #16 Not tainted ------------------------------------------------ syz-executor/7579 is leaving the kernel with locks still held! 1 lock held by syz-executor/7579: #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112 [CAUSE] In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new extent buffer @buf is locked, but if later operations like adding delayed tree ref fail, we just free @buf without unlocking it, resulting above warning. [FIX] Unlock @buf in out_free_buf: label. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-17io_uring: kill fasyncPavel Begunkov1-15/+2
[ Upstream commit 3f008385d46d3cea4a097d2615cd485f2184ba26 ] We have never supported fasync properly, it would only fire when there is something polling io_uring making it useless. The original support came in through the initial io_uring merge for 5.1. Since it's broken and nobody has reported it, get rid of the fasync bits. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-17vboxfs: fix broken legacy mount signature checkingLinus Torvalds1-10/+2
[ Upstream commit 9b3b353ef330e20bc2d99bf3165cc044cff26a09 ] Commit 9d682ea6bcc7 ("vboxsf: Fix the check for the old binary mount-arguments struct") was meant to fix a build error due to sign mismatch in 'char' and the use of character constants, but it just moved the error elsewhere, in that on some architectures characters and signed and on others they are unsigned, and that's just how the C standard works. The proper fix is a simple "don't do that then". The code was just being silly and odd, and it should never have cared about signed vs unsigned characters in the first place, since what it is testing is not four "characters", but four bytes. And the way to compare four bytes is by using "memcmp()". Which compilers will know to just turn into a single 32-bit compare with a constant, as long as you don't have crazy debug options enabled. Link: https://lore.kernel.org/lkml/20210927094123.576521-1-arnd@kernel.org/ Cc: Arnd Bergmann <arnd@kernel.org> Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-17ext4: correct the error path of ext4_write_inline_data_end()Zhang Yi2-12/+10
[ Upstream commit 55ce2f649b9e88111270333a8127e23f4f8f42d7 ] Current error path of ext4_write_inline_data_end() is not correct. Firstly, it should pass out the error value if ext4_get_inode_loc() return fail, or else it could trigger infinite loop if we inject error here. And then it's better to add inode to orphan list if it return fail in ext4_journal_stop(), otherwise we could not restore inline xattr entry after power failure. Finally, we need to reset the 'ret' value if ext4_write_inline_data_end() return success in ext4_write_end() and ext4_journalled_write_end(), otherwise we could not get the error return value of ext4_journal_stop(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-17ext4: check and update i_disksize properlyZhang Yi1-16/+18
[ Upstream commit 4df031ff5876d94b48dd9ee486ba5522382a06b2 ] After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize"), i_disksize could always be updated to i_size in ext4_setattr(), and we could sure that i_disksize <= i_size since holding inode lock and if i_disksize < i_size there are delalloc writes pending in the range upto i_size. If the end of the current write is <= i_size, there's no need to touch i_disksize since writeback will push i_disksize upto i_size eventually. So we can switch to check i_size instead of i_disksize in ext4_da_write_end() when write to the end of the file. we also could remove ext4_mark_inode_dirty() together because we defer inode dirtying to generic_write_end() or ext4_da_write_inline_data_end(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-13afs: Fix afs_launder_page() to set correct start file positionDavid Howells1-2/+1
[ Upstream commit 5c0522484eb54b90f2e46a5db8d7a4ff3ff86e5d ] Fix afs_launder_page() to set the starting position of the StoreData RPC at the offset into the page at which the modified data starts instead of at the beginning of the page (the iov_iter is correctly offset). The offset got lost during the conversion to passing an iov_iter into afs_store_data(). Changes: ver #2: - Use page_offset() rather than manually calculating it[1]. Fixes: bd80d8a80e12 ("afs: Use ITER_XARRAY for writing") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/YST/0e92OdSH0zjg@casper.infradead.org/ [1] Link: https://lore.kernel.org/r/162880783179.3421678.7795105718190440134.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162937512409.1449272.18441473411207824084.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162981148752.1901565.3663780601682206026.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163005741670.2472992.2073548908229887941.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163221839087.3143591.14278359695763025231.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163292980654.4004896.7134735179887998551.stgit@warthog.procyon.org.uk/ # v2 Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-13netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()David Howells1-1/+1
[ Upstream commit 330de47d14af0c3995db81cc03cf5ca683d94d81 ] Fix netfs_clear_unread() to pass READ to iov_iter_xarray() instead of WRITE (the flag is about the operation accessing the buffer, not what sort of access it is doing to the buffer). Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/162729351325.813557.9242842205308443901.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162886603464.3940407.3790841170414793899.stgit@warthog.procyon.org.uk Link: https://lore.kernel.org/r/163239074602.1243337.14154704004485867017.stgit@warthog.procyon.org.uk Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-13nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zeroTrond Myklebust1-8/+11
commit f2e717d655040d632c9015f19aa4275f8b16e7f2 upstream. RFC3530 notes that the 'dircount' field may be zero, in which case the recommendation is to ignore it, and only enforce the 'maxcount' field. In RFC5661, this recommendation to ignore a zero valued field becomes a requirement. Fixes: aee377644146 ("nfsd4: fix rd_dircount enforcement") Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-13nfsd: fix error handling of register_pernet_subsys() in init_nfsd()Patrick Ho1-1/+1
commit 1d625050c7c2dd877e108e382b8aaf1ae3cfe1f4 upstream. init_nfsd() should not unregister pernet subsys if the register fails but should instead unwind from the last successful operation which is register_filesystem(). Unregistering a failed register_pernet_subsys() call can result in a kernel GPF as revealed by programmatically injecting an error in register_pernet_subsys(). Verified the fix handled failure gracefully with no lingering nfsd entry in /proc/filesystems. This change was introduced by the commit bd5ae9288d64 ("nfsd: register pernet ops last, unregister first"), the original error handling logic was correct. Fixes: bd5ae9288d64 ("nfsd: register pernet ops last, unregister first") Cc: stable@vger.kernel.org Signed-off-by: Patrick Ho <Patrick.Ho@netapp.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-13ovl: fix IOCB_DIRECT if underlying fs doesn't support direct IOMiklos Szeredi1-1/+14
commit 1dc1eed46f9fa4cb8a07baa24fb44c96d6dd35c9 upstream. Normally the check at open time suffices, but e.g loop device does set IOCB_DIRECT after doing its own checks (which are not sufficent for overlayfs). Make sure we don't call the underlying filesystem read/write method with the IOCB_DIRECT if it's not supported. Reported-by: Huang Jianan <huangjianan@oppo.com> Fixes: 16914e6fc7e1 ("ovl: add ovl_read_iter()") Cc: <stable@vger.kernel.org> # v4.19 Tested-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-13ovl: fix missing negative dentry check in ovl_rename()Zheng Liang1-3/+7
commit a295aef603e109a47af355477326bd41151765b6 upstream. The following reproducer mkdir lower upper work merge touch lower/old touch lower/new mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge rm merge/new mv merge/old merge/new & unlink upper/new may result in this race: PROCESS A: rename("merge/old", "merge/new"); overwrite=true,ovl_lower_positive(old)=true, ovl_dentry_is_whiteout(new)=true -> flags |= RENAME_EXCHANGE PROCESS B: unlink("upper/new"); PROCESS A: lookup newdentry in new_upperdir call vfs_rename() with negative newdentry and RENAME_EXCHANGE Fix by adding the missing check for negative newdentry. Signed-off-by: Zheng Liang <zhengliang6@huawei.com> Fixes: e9be9d5e76e3 ("overlay filesystem") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>