summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-09-30Merge tag 'ceph-for-6.6-rc4' of https://github.com/ceph/ceph-clientLinus Torvalds1-4/+2
Pull ceph fixes from Ilya Dryomov: "A series that fixes an involved 'double watch error' deadlock in RBD marked for stable and two cleanups" * tag 'ceph-for-6.6-rc4' of https://github.com/ceph/ceph-client: rbd: take header_rwsem in rbd_dev_refresh() only when updating rbd: decouple parent info read-in from updating rbd_dev rbd: decouple header read-in from updating rbd_dev->header rbd: move rbd_dev_refresh() definition Revert "ceph: make members in struct ceph_mds_request_args_ext a union" ceph: remove unnecessary check for NULL in parse_longname()
2023-09-30Merge tag 'xfs-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds4-20/+61
Pull xfs fix from Chandan Babu: - fix for commit 68b957f64fca ("xfs: load uncached unlinked inodes into memory on demand") which address review comments provided by Dave Chinner * tag 'xfs-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix reloading entire unlinked bucket lists
2023-09-28fs/smb/client: Reset password pointer to NULLQuang Le1-0/+1
Forget to reset ctx->password to NULL will lead to bug like double free Cc: stable@vger.kernel.org Cc: Willy Tarreau <w@1wt.eu> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-28iomap: Spelling s/preceeding/preceding/gGeert Uytterhoeven1-1/+1
Fix a misspelling of "preceding". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-28NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not setChuck Lever1-2/+2
nfsd4_encode_readv() uses xdr->buf->page_len as a starting point for the nfsd_iter_read() sink buffer -- page_len is going to be offset by the parts of the COMPOUND that have already been encoded into xdr->buf->pages. However, that value must be captured /before/ xdr_reserve_space_vec() advances page_len by the expected size of the read payload. Otherwise, the whole front part of the first page of the payload in the reply will be uninitialized. Mantas hit this because sec=krb5i forces RQ_SPLICE_OK off, which invokes the readv part of the nfsd4_encode_read() path. Also, older Linux NFS clients appear to send shorter READ requests for files smaller than a page, whereas newer clients just send page-sized requests and let the server send as many bytes as are in the file. Reported-by: Mantas Mikulėnas <grawity@gmail.com> Closes: https://lore.kernel.org/linux-nfs/f1d0b234-e650-0f6e-0f5d-126b3d51d1eb@gmail.com/ Fixes: 703d75215555 ("NFSD: Hoist rq_vec preparation into nfsd_read() [step two]") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-09-26Merge tag 'for-6.6-rc3-tag' of ↵Linus Torvalds8-33/+63
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - delayed refs fixes: - fix race when refilling delayed refs block reserve - prevent transaction block reserve underflow when starting transaction - error message and value adjustments - fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and -Wmaybe-uninitialized - fix for smatch report where uninitialized data from invalid extent buffer range could be returned to the caller - fix numeric overflow in statfs when calculating lower threshold for a full filesystem * tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize start_slot in btrfs_log_prealloc_extents btrfs: make sure to initialize start and len in find_free_dev_extent btrfs: reset destination buffer when read_extent_buffer() gets invalid range btrfs: properly report 0 avail for very full file systems btrfs: log message if extent item not found when running delayed extent op btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 btrfs: prevent transaction block reserve underflow when starting transaction btrfs: fix race when refilling delayed refs block reserve
2023-09-26Merge tag 'v6.6-rc4.vfs.fixes' of ↵Linus Torvalds8-9/+21
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains the usual miscellaneous fixes and cleanups for vfs and individual fses: Fixes: - Revert ki_pos on error from buffered writes for direct io fallback - Add missing documentation for block device and superblock handling for changes merged this cycle - Fix reiserfs flexible array usage - Ensure that overlayfs sets ctime when setting mtime and atime - Disable deferred caller completions with overlayfs writes until proper support exists Cleanups: - Remove duplicate initialization in pipe code - Annotate aio kioctx_table with __counted_by" * tag 'v6.6-rc4.vfs.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: overlayfs: set ctime when setting mtime and atime ntfs3: put resources during ntfs_fill_super() ovl: disable IOCB_DIO_CALLER_COMP porting: document superblock as block device holder porting: document new block device opening order fs/pipe: remove duplicate "offset" initializer fs-writeback: do not requeue a clean inode having skipped pages aio: Annotate struct kioctx_table with __counted_by direct_write_fallback(): on error revert the ->ki_pos update from buffered write reiserfs: Replace 1-element array with C99 style flex-array
2023-09-25iomap: add a workaround for racy i_size updates on block devicesChristoph Hellwig1-1/+10
A szybot reproducer that does write I/O while truncating the size of a block device can end up in clean_bdev_aliases, which tries to clean the bdev aliases that it uses. This is because iomap_to_bh automatically sets the BH_New flag when outside of i_size. For block devices updates to i_size are racy and we can hit this case in a tiny race window, leading to the eventual clean_bdev_aliases call. Fix this by erroring out of > i_size I/O on block devices. Reported-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-25overlayfs: set ctime when setting mtime and atimeJeff Layton1-1/+1
Nathan reported that he was seeing the new warning in setattr_copy_mgtime pop when starting podman containers. Overlayfs is trying to set the atime and mtime via notify_change without also setting the ctime. POSIX states that when the atime and mtime are updated via utimes() that we must also update the ctime to the current time. The situation with overlayfs copy-up is analogies, so add ATTR_CTIME to the bitmask. notify_change will fill in the value. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230913-ctime-v1-1-c6bc509cbc27@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-25ntfs3: put resources during ntfs_fill_super()Christian Brauner1-0/+1
During ntfs_fill_super() some resources are allocated that we need to cleanup in ->put_super() such as additional inodes. When ntfs_fill_super() fails these resources need to be cleaned up as well. Reported-by: syzbot+2751da923b5eb8307b0b@syzkaller.appspotmail.com Fixes: 78a06688a4d4 ("ntfs3: drop inode references in ntfs_put_super()") Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-25ovl: disable IOCB_DIO_CALLER_COMPJens Axboe1-0/+6
overlayfs copies the kiocb flags when it sets up a new kiocb to handle a write, but it doesn't properly support dealing with the deferred caller completions of the kiocb. This means it doesn't get the final write completion value, and hence will complete the write with '0' as the result. We could support the caller completions in overlayfs, but for now let's just disable them in the generated write kiocb. Reported-by: Zorro Lang <zlang@redhat.com> Link: https://lore.kernel.org/io-uring/20230924142754.ejwsjen5pvyc32l4@dell-per750-06-vm-08.rhts.eng.pek2.redhat.com/ Fixes: 8c052fb3002e ("iomap: support IOCB_DIO_CALLER_COMP") Signed-off-by: Jens Axboe <axboe@kernel.dk> Message-Id: <71897125-e570-46ce-946a-d4729725e28f@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-25xfs: fix reloading entire unlinked bucket listsDarrick J. Wong4-20/+61
During review of the patcheset that provided reloading of the incore iunlink list, Dave made a few suggestions, and I updated the copy in my dev tree. Unfortunately, I then got distracted by ... who even knows what ... and forgot to backport those changes from my dev tree to my release candidate branch. I then sent multiple pull requests with stale patches, and that's what was merged into -rc3. So. This patch re-adds the use of an unlocked iunlink list check to determine if we want to allocate the resources to recreate the incore list. Since lost iunlinked inodes are supposed to be rare, this change helps us avoid paying the transaction and AGF locking costs every time we open any inode. This also re-adds the shutdowns on failure, and re-applies the restructuring of the inner loop in xfs_inode_reload_unlinked_bucket, and re-adds a requested comment about the quotachecking code. Retain the original RVB tag from Dave since there's no code change from the last submission. Fixes: 68b957f64fca1 ("xfs: load uncached unlinked inodes into memory on demand") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-24Merge tag 'trace-v6.6-rc2' of ↵Linus Torvalds1-17/+70
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the "bytes" output of the per_cpu stat file The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the accounting was not accurate. It is suppose to show how many used bytes are still in the ring buffer, but even when the ring buffer was empty it would still show there were bytes used. - Fix a bug in eventfs where reading a dynamic event directory (open) and then creating a dynamic event that goes into that diretory screws up the accounting. On close, the newly created event dentry will get a "dput" without ever having a "dget" done for it. The fix is to allocate an array on dir open to save what dentries were actually "dget" on, and what ones to "dput" on close. * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Remember what dentries were created on dir open ring-buffer: Fix bytes info in per_cpu buffer stats
2023-09-23Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of ↵Linus Torvalds2-29/+37
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes, 10 of which pertain to post-6.5 issues. The other three are cc:stable" * tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: proc: nommu: fix empty /proc/<pid>/maps filemap: add filemap_map_order0_folio() to handle order0 folio proc: nommu: /proc/<pid>/maps: release mmap read lock mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement pidfd: prevent a kernel-doc warning argv_split: fix kernel-doc warnings scatterlist: add missing function params to kernel-doc selftests/proc: fixup proc-empty-vm test after KSM changes revert "scripts/gdb/symbols: add specific ko module load command" selftests: link libasan statically for tests with -fsanitize=address task_work: add kerneldoc annotation for 'data' argument mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
2023-09-23Merge tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds11-27/+60
Pull smb client fixes from Steve French: "Six smb3 client fixes, including three for stable, from the SMB plugfest (testing event) this week: - Reparse point handling fix (found when investigating dir enumeration when fifo in dir) - Fix excessive thread creation for dir lease cleanup - UAF fix in negotiate path - remove duplicate error message mapping and fix confusing warning message - add dynamic trace point to improve debugging RDMA connection attempts" * tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: fix confusing debug message smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLED smb3: remove duplicate error mapping cifs: Fix UAF in cifs_demultiplex_thread() smb3: do not start laundromat thread when dir leases disabled smb3: Add dynamic trace points for RDMA (smbdirect) reconnect
2023-09-23Merge tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2-23/+32
Pull iomap fixes from Darrick Wong: - Return EIO on bad inputs to iomap_to_bh instead of BUGging, to deal less poorly with block device io racing with block device resizing - Fix a stale page data exposure bug introduced in 6.6-rc1 when unsharing a file range that is not in the page cache * tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: convert iomap_unshare_iter to use large folios iomap: don't skip reading in !uptodate folios when unsharing a range iomap: handle error conditions more gracefully in iomap_to_bh
2023-09-23Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds27-240/+441
Pull xfs fixes from Chandan Babu: - Fix an integer overflow bug when processing an fsmap call - Fix crash due to CPU hot remove event racing with filesystem mount operation - During read-only mount, XFS does not allow the contents of the log to be recovered when there are one or more unrecognized rcompat features in the primary superblock, since the log might have intent items which the kernel does not know how to process - During recovery of log intent items, XFS now reserves log space sufficient for one cycle of a permanent transaction to execute. Otherwise, this could lead to livelocks due to non-availability of log space - On an fs which has an ondisk unlinked inode list, trying to delete a file or allocating an O_TMPFILE file can cause the fs to the shutdown if the first inode in the ondisk inode list is not present in the inode cache. The bug is solved by explicitly loading the first inode in the ondisk unlinked inode list into the inode cache if it is not already cached A similar problem arises when the uncached inode is present in the middle of the ondisk unlinked inode list. This second bug is triggered when executing operations like quotacheck and bulkstat. In this case, XFS now reads in the entire ondisk unlinked inode list - Enable LARP mode only on recent v5 filesystems - Fix a out of bounds memory access in scrub - Fix a performance bug when locating the tail of the log during mounting a filesystem * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail xfs: only call xchk_stats_merge after validating scrub inputs xfs: require a relatively recent V5 filesystem for LARP mode xfs: make inode unlinked bucket recovery work with quotacheck xfs: load uncached unlinked inodes into memory on demand xfs: reserve less log space when recovering log intent items xfs: fix log recovery when unknown rocompat bits are set xfs: reload entire unlinked bucket lists xfs: allow inode inactivation during a ro mount log recovery xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list xfs: remove CPU hotplug infrastructure xfs: remove the all-mounts list xfs: use per-mount cpumask to track nonempty percpu inodegc lists xfs: fix an agbno overflow in __xfs_getfsmap_datadev xfs: fix per-cpu CIL structure aggregation racing with dying cpus xfs: fix select in config XFS_ONLINE_SCRUB_STATS
2023-09-22eventfs: Remember what dentries were created on dir openSteven Rostedt (Google)1-17/+70
Using the following code with libtracefs: int dfd; // create the directory events/kprobes/kp1 tracefs_kprobe_raw(NULL, "kp1", "schedule_timeout", "time=$arg1"); // Open the kprobes directory dfd = tracefs_instance_file_open(NULL, "events/kprobes", O_RDONLY); // Do a lookup of the kprobes/kp1 directory (by looking at enable) tracefs_file_exists(NULL, "events/kprobes/kp1/enable"); // Now create a new entry in the kprobes directory tracefs_kprobe_raw(NULL, "kp2", "schedule_hrtimeout", "expires=$arg1"); // Do another lookup to create the dentries tracefs_file_exists(NULL, "events/kprobes/kp2/enable")) // Close the directory close(dfd); What happened above, the first open (dfd) will call dcache_dir_open_wrapper() that will create the dentries and up their ref counts. Now the creation of "kp2" will add another dentry within the kprobes directory. Upon the close of dfd, eventfs_release() will now do a dput for all the entries in kprobes. But this is where the problem lies. The open only upped the dentry of kp1 and not kp2. Now the close is decrementing both kp1 and kp2, which causes kp2 to get a negative count. Doing a "trace-cmd reset" which deletes all the kprobes cause the kernel to crash! (due to the messed up accounting of the ref counts). To solve this, save all the dentries that are opened in the dcache_dir_open_wrapper() into an array, and use this array to know what dentries to do a dput on in eventfs_release(). Since the dcache_dir_open_wrapper() calls dcache_dir_open() which uses the file->private_data, we need to also add a wrapper around dcache_readdir() that uses the cursor assigned to the file->private_data. This is because the dentries need to also be saved in the file->private_data. To do this create the structure: struct dentry_list { void *cursor; struct dentry **dentries; }; Which will hold both the cursor and the dentries. Some shuffling around is needed to make sure that dcache_dir_open() and dcache_readdir() only see the cursor. Link: https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open functions") Reported-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-21ksmbd: check iov vector index in ksmbd_conn_write()Namjae Jeon1-0/+3
If ->iov_idx is zero, This means that the iov vector for the response was not added during the request process. In other words, it means that there is a problem in generating a response, So this patch return as an error to avoid NULL pointer dereferencing problem. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-21ksmbd: return invalid parameter error response if smb2 request is invalidNamjae Jeon2-4/+4
If smb2 request from client is invalid, The following kernel oops could happen. The patch e2b76ab8b5c9: "ksmbd: add support for read compound" leads this issue. When request is invalid, It doesn't set anything in the response buffer. This patch add missing set invalid parameter error response. [ 673.085542] ksmbd: cli req too short, len 184 not 142. cmd:5 mid:109 [ 673.085580] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 673.085591] #PF: supervisor read access in kernel mode [ 673.085600] #PF: error_code(0x0000) - not-present page [ 673.085608] PGD 0 P4D 0 [ 673.085620] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 673.085631] CPU: 3 PID: 1039 Comm: kworker/3:0 Not tainted 6.6.0-rc2-tmt #16 [ 673.085643] Hardware name: AZW U59/U59, BIOS JTKT001 05/05/2022 [ 673.085651] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] [ 673.085719] RIP: 0010:ksmbd_conn_write+0x68/0xc0 [ksmbd] [ 673.085808] RAX: 0000000000000000 RBX: ffff88811ade4f00 RCX: 0000000000000000 [ 673.085817] RDX: 0000000000000000 RSI: ffff88810c2a9780 RDI: ffff88810c2a9ac0 [ 673.085826] RBP: ffffc900005e3e00 R08: 0000000000000000 R09: 0000000000000000 [ 673.085834] R10: ffffffffa3168160 R11: 63203a64626d736b R12: ffff8881057c8800 [ 673.085842] R13: ffff8881057c8820 R14: ffff8882781b2380 R15: ffff8881057c8800 [ 673.085852] FS: 0000000000000000(0000) GS:ffff888278180000(0000) knlGS:0000000000000000 [ 673.085864] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 673.085872] CR2: 0000000000000000 CR3: 000000015b63c000 CR4: 0000000000350ee0 [ 673.085883] Call Trace: [ 673.085890] <TASK> [ 673.085900] ? show_regs+0x6a/0x80 [ 673.085916] ? __die+0x25/0x70 [ 673.085926] ? page_fault_oops+0x154/0x4b0 [ 673.085938] ? tick_nohz_tick_stopped+0x18/0x50 [ 673.085954] ? __irq_work_queue_local+0xba/0x140 [ 673.085967] ? do_user_addr_fault+0x30f/0x6c0 [ 673.085979] ? exc_page_fault+0x79/0x180 [ 673.085992] ? asm_exc_page_fault+0x27/0x30 [ 673.086009] ? ksmbd_conn_write+0x68/0xc0 [ksmbd] [ 673.086067] ? ksmbd_conn_write+0x46/0xc0 [ksmbd] [ 673.086123] handle_ksmbd_work+0x28d/0x4b0 [ksmbd] [ 673.086177] process_one_work+0x178/0x350 [ 673.086193] ? __pfx_worker_thread+0x10/0x10 [ 673.086202] worker_thread+0x2f3/0x420 [ 673.086210] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ 673.086222] ? __pfx_worker_thread+0x10/0x10 [ 673.086230] kthread+0x103/0x140 [ 673.086242] ? __pfx_kthread+0x10/0x10 [ 673.086253] ret_from_fork+0x39/0x60 [ 673.086263] ? __pfx_kthread+0x10/0x10 [ 673.086274] ret_from_fork_asm+0x1b/0x30 Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: Tom Talpey <tom@talpey.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-21Merge tag 'v6.6-rc3.vfs.ctime.revert' of ↵Linus Torvalds8-133/+35
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull finegrained timestamp reverts from Christian Brauner: "Earlier this week we sent a few minor fixes for the multi-grained timestamp work in [1]. While we were polishing those up after Linus realized that there might be a nicer way to fix them we received a regression report in [2] that fine grained timestamps break gnulib tests and thus possibly other tools. The kernel will elide fine-grain timestamp updates when no one is actively querying for them to avoid performance impacts. So a sequence like write(f1) stat(f2) write(f2) stat(f2) write(f1) stat(f1) may result in timestamp f1 to be older than the final f2 timestamp even though f1 was last written too but the second write didn't update the timestamp. Such plotholes can lead to subtle bugs when programs compare timestamps. For example, the nap() function in [2] will estimate that it needs to wait one ns on a fine-grain timestamp enabled filesytem between subsequent calls to observe a timestamp change. But in general we don't update timestamps with more than one jiffie if we think that no one is actively querying for fine-grain timestamps to avoid performance impacts. While discussing various fixes the decision was to go back to the drawing board and ultimately to explore a solution that involves only exposing such fine-grained timestamps to nfs internally and never to userspace. As there are multiple solutions discussed the honest thing to do here is not to fix this up or disable it but to cleanly revert. The general infrastructure will probably come back but there is no reason to keep this code in mainline. The general changes to timestamp handling are valid and a good cleanup that will stay. The revert is fully bisectable" Link: https://lore.kernel.org/all/20230918-hirte-neuzugang-4c2324e7bae3@brauner [1] Link: https://lore.kernel.org/all/bf0524debb976627693e12ad23690094e4514303.camel@linuxfromscratch.org [2] * tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "fs: add infrastructure for multigrain timestamps" Revert "btrfs: convert to multigrain timestamps" Revert "ext4: switch to multigrain timestamps" Revert "xfs: switch to multigrain timestamps" Revert "tmpfs: add support for multigrain timestamps"
2023-09-21btrfs: initialize start_slot in btrfs_log_prealloc_extentsJosef Bacik1-1/+1
Jens reported a compiler warning when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’: fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used uninitialized [-Wmaybe-uninitialized] 4828 | ret = copy_items(trans, inode, dst_path, path, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4829 | start_slot, ins_nr, 1, 0); | ~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here 4725 | int start_slot; | ^~~~~~~~~~ The compiler is incorrect, as we only use this code when ins_len > 0, and when ins_len > 0 we have start_slot properly initialized. However we generally find the -Wmaybe-uninitialized warnings valuable, so initialize start_slot to get rid of the warning. Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-21btrfs: make sure to initialize start and len in find_free_dev_extentJosef Bacik1-7/+6
Jens reported a compiler error when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this In function ‘gather_device_info’, inlined from ‘btrfs_create_chunk’ at fs/btrfs/volumes.c:5507:8: fs/btrfs/volumes.c:5245:48: warning: ‘dev_offset’ may be used uninitialized [-Wmaybe-uninitialized] 5245 | devices_info[ndevs].dev_offset = dev_offset; | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ fs/btrfs/volumes.c: In function ‘btrfs_create_chunk’: fs/btrfs/volumes.c:5196:13: note: ‘dev_offset’ was declared here 5196 | u64 dev_offset; This occurs because find_free_dev_extent is responsible for setting dev_offset, however if we get an -ENOMEM at the top of the function we'll return without setting the value. This isn't actually a problem because we will see the -ENOMEM in gather_device_info() and return and not use the uninitialized value, however we also just don't want the compiler warning so rework the code slightly in find_free_dev_extent() to make sure it's always setting *start and *len to avoid the compiler warning. Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-21smb3: fix confusing debug messageSteve French1-1/+1
The message said it was an invalid mode, when it was intentionally not set. Fix confusing message logged to dmesg. Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-21smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLEDPaulo Alcantara1-0/+3
Fix missing set of cifs_open_info_data::reparse_point when SMB2_CREATE request fails with STATUS_IO_REPARSE_TAG_NOT_HANDLED. Fixes: 5f71ebc41294 ("smb: client: parse reparse point flag in create response") Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-21smb3: remove duplicate error mappingSteve French1-2/+0
In status_to_posix_error STATUS_IO_REPARSE_TAG_NOT_HANDLED was mapped to both -EOPNOTSUPP and also to -EIO but the later one (-EIO) is ignored. Remove the duplicate. Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-20btrfs: reset destination buffer when read_extent_buffer() gets invalid rangeQu Wenruo1-1/+7
Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle invalid extent buffer range for read_extent_buffer(). Previously if the range is invalid we just set the destination to zero, but after the patch we do nothing and error out. This can lead to smatch static checker errors like: fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'. fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'. fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'. fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'. fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'. Fix those warnings by reverting back to the old memset() behavior. By this we keep the static checker happy and would still make a lot of noise when such invalid ranges are passed in. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: properly report 0 avail for very full file systemsJosef Bacik1-1/+1
A user reported some issues with smaller file systems that get very full. While investigating this issue I noticed that df wasn't showing 100% full, despite having 0 chunk space and having < 1MiB of available metadata space. This turns out to be an overflow issue, we're doing: total_available_metadata_space - SZ_4M < global_block_rsv_size to determine if there's not enough space to make metadata allocations, which overflows if total_available_metadata_space is < 4M. Fix this by checking to see if our available space is greater than the 4M threshold. This makes df properly report 100% usage on the file system. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: log message if extent item not found when running delayed extent opFilipe Manana1-1/+4
When running a delayed extent operation, if we don't find the extent item in the extent tree we just return -EIO without any logged message. This indicates some bug or possibly a memory or fs corruption, so the return value should not be -EIO but -EUCLEAN instead, and since it's not expected to ever happen, print an informative error message so that if it happens we have some idea of what went wrong, where to look at. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref()Filipe Manana1-4/+3
At __btrfs_inc_extent_ref() we are doing a BUG_ON() if we are dealing with a tree block reference that has a reference count that is different from 1, but we have already dealt with this case at run_delayed_tree_ref(), making it useless. So remove the BUG_ON(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1Filipe Manana1-3/+3
When running a delayed tree reference, if we find a ref count different from 1, we return -EIO. This isn't an IO error, as it indicates either a bug in the delayed refs code or a memory corruption, so change the error code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is not expected to ever happen, and change the error message to print the tree block's bytenr without the parenthesis (and there was a missing space between the 'block' word and the opening parenthesis), for consistency as that's the style we used everywhere else. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: prevent transaction block reserve underflow when starting transactionFilipe Manana3-12/+4
When starting a transaction, with a non-zero number of items, we reserve metadata space for that number of items and for delayed refs by doing a call to btrfs_block_rsv_add(), with the transaction block reserve passed as the block reserve argument. This reserves metadata space and adds it to the transaction block reserve. Later we migrate the space we reserved for delayed references from the transaction block reserve into the delayed refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv(). btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to migrate from the source block reserve, and this however may result in an underflow in case the space added to the transaction block reserve ended up being used by another task that has not reserved enough space for its own use - examples are tasks doing reflinks or hole punching because they end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and may need to modify/COW a variable number of leaves/paths, so they keep trying to use space from the transaction block reserve when they need to COW an extent buffer, and may end up trying to use more space then they have reserved (1 unit/path only for removing file extent items). This can be avoided by simply reserving space first without adding it to the transaction block reserve, then add the space for delayed refs to the delayed refs block reserve and finally add the remaining reserved space to the transaction block reserve. This also makes the code a bit shorter and simpler. So just do that. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: fix race when refilling delayed refs block reserveFilipe Manana1-3/+34
If we have two (or more) tasks attempting to refill the delayed refs block reserve we can end up with the delayed block reserve being over reserved, that is, with a reserved space greater than its size. If this happens, we are holding to more reserved space than necessary for a while. The race happens like this: 1) The delayed refs block reserve has a size of 8M and a reserved space of 6M for example; 2) Task A calls btrfs_delayed_refs_rsv_refill(); 3) Task B also calls btrfs_delayed_refs_rsv_refill(); 4) Task A sees there's a 2M difference between the size and the reserved space of the delayed refs rsv, so it will reserve 2M of space by calling btrfs_reserve_metadata_bytes(); 5) Task B also sees that 2M difference, and like task A, it reserves another 2M of metadata space; 6) Both task A and task B increase the reserved space of block reserve by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve ends up with a size of 8M and a reserved space of 10M; 7) The extra, over reserved space will eventually be freed by some task calling btrfs_delayed_refs_rsv_release() -> btrfs_block_rsv_release() -> block_rsv_release_bytes(), as there we will detect the over reserve and release that space. So fix this by checking if we still need to add space to the delayed refs block reserve after reserving the metadata space, and if we don't, just release that space immediately. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20Merge tag 'for-6.6-rc2-tag' of ↵Linus Torvalds4-53/+69
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more followup fixes to the directory listing. People have noticed different behaviour compared to other filesystems after changes in 6.5. This is now unified to more "logical" and expected behaviour while still within POSIX. And a few more fixes for stable. - change behaviour of readdir()/rewinddir() when new directory entries are created after opendir(), properly tracking the last entry - fix race in readdir when multiple threads can set the last entry index for a directory Additionally: - use exclusive lock when direct io might need to drop privs and call notify_change() - don't clear uptodate bit on page after an error, this may lead to a deadlock in subpage mode - fix waiting pattern when multiple readers block on Merkle tree data, switch to folios" * tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix race between reading a directory and adding entries to it btrfs: refresh dir last index during a rewinddir(3) call btrfs: set last dir index to the current last index when opening dir btrfs: don't clear uptodate on write errors btrfs: file_remove_privs needs an exclusive lock in direct io write btrfs: convert btrfs_read_merkle_tree_page() to use a folio
2023-09-20Revert "fs: add infrastructure for multigrain timestamps"Christian Brauner2-118/+5
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20Revert "btrfs: convert to multigrain timestamps"Christian Brauner2-7/+22
This reverts commit 50e9ceef1d4f644ee0049e82e360058a64ec284c. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20Revert "ext4: switch to multigrain timestamps"Christian Brauner1-1/+1
This reverts commit 0269b585868e59b6a2ecc6ea685d39310e4fc18b. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20Revert "xfs: switch to multigrain timestamps"Christian Brauner3-7/+7
This reverts commit e44df2664746aed8b6dd5245eb711a0ce33c5cf5. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20fs/pipe: remove duplicate "offset" initializerMax Kellermann1-1/+0
This code duplication was introduced by commit a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot"), but since the pipe's mutex is locked, nobody else can modify the value meanwhile. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Message-Id: <20230919074045.1066796-1-max.kellermann@ionos.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20fs-writeback: do not requeue a clean inode having skipped pagesChunhai Guo1-3/+8
When writing back an inode and performing an fsync on it concurrently, a deadlock issue may arise as shown below. In each writeback iteration, a clean inode is requeued to the wb->b_dirty queue due to non-zero pages_skipped, without anything actually being written. This causes an infinite loop and prevents the plug from being flushed, resulting in a deadlock. We now avoid requeuing the clean inode to prevent this issue. wb_writeback fsync (inode-Y) blk_start_plug(&plug) for (;;) { iter i-1: some reqs with page-X added into plug->mq_list // f2fs node page-X with PG_writeback filemap_fdatawrite __filemap_fdatawrite_range // write inode-Y with sync_mode WB_SYNC_ALL do_writepages f2fs_write_data_pages __f2fs_write_data_pages // wb_sync_req[DATA]++ for WB_SYNC_ALL f2fs_write_cache_pages f2fs_write_single_data_page f2fs_do_write_data_page f2fs_outplace_write_data f2fs_update_data_blkaddr f2fs_wait_on_page_writeback wait_on_page_writeback // wait for f2fs node page-X iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes . __writeback_single_inode // write inode-Y with sync_mode WB_SYNC_NONE . . do_writepages . . f2fs_write_data_pages . . . __f2fs_write_data_pages // skip writepages due to (wb_sync_req[DATA]>0) . . . wbc->pages_skipped += get_dirty_pages(inode) // wbc->pages_skipped = 1 . if (!(inode->i_state & I_DIRTY_ALL)) // i_state = I_SYNC | I_SYNC_QUEUED . total_wrote++; // total_wrote = 1 . requeue_inode // requeue inode-Y to wb->b_dirty queue due to non-zero pages_skipped if (progress) // progress = 1 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230916045131.957929-1-guochunhai@vivo.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20aio: Annotate struct kioctx_table with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct kioctx_table. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-aio@kvack.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Message-Id: <20230915201413.never.881-kees@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20direct_write_fallback(): on error revert the ->ki_pos update from buffered writeAl Viro1-0/+1
If we fail filemap_write_and_wait_range() on the range the buffered write went into, we only report the "number of bytes which we direct-written", to quote the comment in there. Which is fine, but buffered write has already advanced iocb->ki_pos, so we need to roll that back. Otherwise we end up with e.g. write(2) advancing position by more than the amount it reports having written. Fixes: 182c25e9c157 "filemap: update ki_pos in generic_perform_write" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Message-Id: <20230827214518.GU3390869@ZenIV> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20cifs: Fix UAF in cifs_demultiplex_thread()Zhang Xiaoxu2-11/+24
There is a UAF when xfstests on cifs: BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160 Read of size 4 at addr ffff88810103fc08 by task cifsd/923 CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45 ... Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report+0x171/0x472 kasan_report+0xad/0x130 kasan_check_range+0x145/0x1a0 smb2_is_network_name_deleted+0x27/0x160 cifs_demultiplex_thread.cold+0x172/0x5a4 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 923: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x54/0x60 kmem_cache_alloc+0x147/0x320 mempool_alloc+0xe1/0x260 cifs_small_buf_get+0x24/0x60 allocate_buffers+0xa1/0x1c0 cifs_demultiplex_thread+0x199/0x10d0 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 Freed by task 921: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x143/0x1b0 kmem_cache_free+0xe3/0x4d0 cifs_small_buf_release+0x29/0x90 SMB2_negotiate+0x8b7/0x1c60 smb2_negotiate+0x51/0x70 cifs_negotiate_protocol+0xf0/0x160 cifs_get_smb_ses+0x5fa/0x13c0 mount_get_conns+0x7a/0x750 cifs_mount+0x103/0xd00 cifs_smb3_do_mount+0x1dd/0xcb0 smb3_get_tree+0x1d5/0x300 vfs_get_tree+0x41/0xf0 path_mount+0x9b3/0xdd0 __x64_sys_mount+0x190/0x1d0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The UAF is because: mount(pid: 921) | cifsd(pid: 923) -------------------------------|------------------------------- | cifs_demultiplex_thread SMB2_negotiate | cifs_send_recv | compound_send_recv | smb_send_rqst | wait_for_response | wait_event_state [1] | | standard_receive3 | cifs_handle_standard | handle_mid | mid->resp_buf = buf; [2] | dequeue_mid [3] KILL the process [4] | resp_iov[i].iov_base = buf | free_rsp_buf [5] | | is_network_name_deleted [6] | callback 1. After send request to server, wait the response until mid->mid_state != SUBMITTED; 2. Receive response from server, and set it to mid; 3. Set the mid state to RECEIVED; 4. Kill the process, the mid state already RECEIVED, get 0; 5. Handle and release the negotiate response; 6. UAF. It can be easily reproduce with add some delay in [3] - [6]. Only sync call has the problem since async call's callback is executed in cifsd process. Add an extra state to mark the mid state to READY before wakeup the waitter, then it can get the resp safely. Fixes: ec637e3ffb6b ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages") Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19proc: nommu: fix empty /proc/<pid>/mapsBen Wolsieffer2-17/+22
On no-MMU, /proc/<pid>/maps reads as an empty file. This happens because find_vma(mm, 0) always returns NULL (assuming no vma actually contains the zero address, which is normally the case). To fix this bug and improve the maintainability in the future, this patch makes the no-MMU implementation as similar as possible to the MMU implementation. The only remaining differences are the lack of hold/release_task_mempolicy and the extra code to shoehorn the gate vma into the iterator. This has been tested on top of 6.5.3 on an STM32F746. Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com Fixes: 0c563f148043 ("proc: remove VMA rbtree use from nommu") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Giulio Benetti <giulio.benetti@benettiengineering.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19proc: nommu: /proc/<pid>/maps: release mmap read lockBen Wolsieffer1-12/+15
The no-MMU implementation of /proc/<pid>/map doesn't normally release the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine whether to release the lock. Since _vml is NULL when the end of the mappings is reached, the lock is not released. Reading /proc/1/maps twice doesn't cause a hang because it only takes the read lock, which can be taken multiple times and therefore doesn't show any problem if the lock isn't released. Instead, you need to perform some operation that attempts to take the write lock after reading /proc/<pid>/maps. To actually reproduce the bug, compile the following code as 'proc_maps_bug': #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main(int argc, char *argv[]) { void *buf; sleep(1); buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); puts("mmap returned"); return 0; } Then, run: ./proc_maps_bug &; cat /proc/$!/maps; fg Without this patch, mmap() will hang and the command will never complete. This code was incorrectly adapted from the MMU implementation, which at the time released the lock in m_next() before returning the last entry. The MMU implementation has diverged further from the no-MMU version since then, so this patch brings their locking and error handling into sync, fixing the bug and hopefully avoiding similar issues in the future. Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Giulio Benetti <giulio.benetti@benettiengineering.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19smb3: do not start laundromat thread when dir leasesSteve French6-10/+24
disabled When no directory lease support, or for IPC shares where directories can not be opened, do not start an unneeded laundromat thread for that mount (it wastes resources). Fixes: d14de8067e3f ("cifs: Add a laundromat thread for cached directories") Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19iomap: convert iomap_unshare_iter to use large foliosDarrick J. Wong1-10/+14
Convert iomap_unshare_iter to create large folios if possible, since the write and zeroing paths already do that. I think this got missed in the conversion of the write paths that landed in 6.6-rc1. Cc: ritesh.list@gmail.com, willy@infradead.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
2023-09-19smb3: Add dynamic trace points for RDMA (smbdirect) reconnectSteve French2-3/+8
smb3_smbd_connect_done and smb3_smbd_connect_err To improve debugging of RDMA issues add those two. We already had dynamic tracepoints for non-RDMA connect done and error cases. Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19iomap: don't skip reading in !uptodate folios when unsharing a rangeDarrick J. Wong1-2/+4
Prior to commit a01b8f225248e, we would always read in the contents of a !uptodate folio prior to writing userspace data into the folio, allocated a folio state object, etc. Ritesh introduced an optimization that skips all of that if the write would cover the entire folio. Unfortunately, the optimization misses the unshare case, where we always have to read in the folio contents since there isn't a data buffer supplied by userspace. This can result in stale kernel memory exposure if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared file that isn't already cached. This was caught by observing fstests regressions in the "unshare around" mechanism that is used for unaligned writes to a reflinked realtime volume when the realtime extent size is larger than 1FSB, though I think it applies to any shared file. Cc: ritesh.list@gmail.com, willy@infradead.org Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
2023-09-18Merge tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds5-54/+116
Pull NFS client fixes from Anna Schumaker: "Various O_DIRECT related fixes from Trond: - Error handling - Locking issues - Use the correct commit info for joining page groups - Fixes for rescheduling IO Sunrpc bad verifier fixes: - Report EINVAL errors from connect() - Revalidate creds that the server has rejected - Revert "SUNRPC: Fail faster on bad verifier" Misc: - Fix pNFS session trunking when MDS=DS - Fix zero-value filehandles for post-open getattr operations - Fix compiler warning about tautological comparisons - Revert 'SUNRPC: clean up integer overflow check' before Trond's fix" * tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: Silence compiler complaints about tautological comparisons Revert "SUNRPC: clean up integer overflow check" NFSv4.1: fix zero value filehandle in post open getattr NFSv4.1: fix pnfs MDS=DS session trunking Revert "SUNRPC: Fail faster on bad verifier" SUNRPC: Mark the cred for revalidation if the server rejects it NFS/pNFS: Report EINVAL errors from connect() to the server NFS: More fixes for nfs_direct_write_reschedule_io() NFS: Use the correct commit info in nfs_join_page_group() NFS: More O_DIRECT accounting fixes for error paths NFS: Fix O_DIRECT locking issues NFS: Fix error handling for O_DIRECT write scheduling