summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-06-17Merge tag 'fs_for_v5.19-rc3' of ↵Linus Torvalds1-6/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback and ext2 fixes from Jan Kara: "A fix for writeback bug which prevented machines with kdevtmpfs from booting and also one small ext2 bugfix in IO error handling" * tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: init: Initialize noop_backing_dev_info early ext2: fix fs corruption when trying to remove a non-empty directory with IO error
2022-06-17io_uring: recycle provided buffer if we punt to io-wqJens Axboe1-0/+1
io_arm_poll_handler() will recycle the buffer appropriately if we end up arming poll (or if we're ready to retry), but not for the io-wq case if we have attempted poll first. Explicitly recycle the buffer to avoid both hanging on to it too long, but also to avoid multiple reads grabbing the same one. This can happen for ring mapped buffers, since it hasn't necessarily been committed. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Link: https://github.com/axboe/liburing/issues/605 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17fat: add renameat2 RENAME_EXCHANGE flag supportJavier Martinez Canillas1-1/+122
The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths but is currently not supported by the Linux vfat filesystem driver. Add a vfat_rename_exchange() helper function that implements this support. The super block lock is acquired during the operation to ensure atomicity, and in the error path actions made are reversed also with the mutex held. It makes the operation as transactional as possible, within the limitation impossed by vfat due not having a journal with logs to replay. Link: https://lkml.kernel.org/r/20220610075721.1182745-4-javierm@redhat.com Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Alexander Larsson <alexl@redhat.com> Cc: Christian Kellner <ckellner@redhat.com> Cc: Chung-Chiang Cheng <cccheng@synology.com> Cc: Colin Walters <walters@verbum.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Jones <pjones@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17fat: factor out reusable code in vfat_rename() as helper functionsOGAWA Hirofumi1-32/+57
The vfat_rename() function is quite big and there are code blocks that can be moved into helper functions. This not only simplify the implementation of that function but also allows these helpers to be reused. For example, the helpers can be used by the handler of the RENAME_EXCHANGE flag once this is implemented in a subsequent patch. Link: https://lkml.kernel.org/r/20220610075721.1182745-3-javierm@redhat.com Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Alexander Larsson <alexl@redhat.com> Cc: Christian Kellner <ckellner@redhat.com> Cc: Chung-Chiang Cheng <cccheng@synology.com> Cc: Colin Walters <walters@verbum.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Jones <pjones@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17fat: add a vfat_rename2() and make existing .rename callback a helperJavier Martinez Canillas1-7/+14
Patch series "fat: add support for the renameat2 RENAME_EXCHANGE flag", v6. The series adds support for the renameat2 system call RENAME_EXCHANGE flag (which allows to atomically replace two paths) to the vfat filesystem code. There are many use cases for this, but we are particularly interested in making possible for vfat filesystems to be part of OSTree [0] deployments. Currently OSTree relies on symbolic links to make the deployment updates an atomic transactional operation. But RENAME_EXCHANGE could be used [1] to achieve a similar level of robustness when using a vfat filesystem. Patch #1 is just a preparatory patch to introduce the RENAME_EXCHANGE support, patch #2 moves some code blocks in vfat_rename() to a set of helper functions, that can be reused by tvfat_rename_exchange() that's added by patch #3 and finally patch #4 adds some kselftests to test it. This patch (of 4): Currently vfat only supports the RENAME_NOREPLACE flag which is handled by the virtual file system layer but doesn't support the RENAME_EXCHANGE flag. Add a vfat_rename2() function to be used as the .rename callback and move the current vfat_rename() handler to a helper. This is in preparation for implementing the RENAME_NOREPLACE flag using a different helper function. Link: https://lkml.kernel.org/r/20220610075721.1182745-1-javierm@redhat.com Link: https://lkml.kernel.org/r/20220610075721.1182745-2-javierm@redhat.com Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christian Kellner <ckellner@redhat.com> Cc: Peter Jones <pjones@redhat.com> Cc: Chung-Chiang Cheng <cccheng@synology.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Alexander Larsson <alexl@redhat.com> Cc: Colin Walters <walters@verbum.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17squashfs: don't use intermediate buffer if pages missingPhillip Lougher1-63/+12
Now that the "page actor" can handle missing pages, we don't have to fall back to using an intermediate buffer in Squashfs_readpage_block() if all the pages necessary can't be obtained. Link: https://lkml.kernel.org/r/20220611032133.5743-3-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Xiongwei Song <Xiongwei.Song@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17squashfs: extend "page actor" to handle missing pagesPhillip Lougher10-31/+126
Patch series "Squashfs: handle missing pages decompressing into page cache". This patchset enables Squashfs to handle missing pages when directly decompressing datablocks into the page cache. Previously if the full set of pages needed was not available, Squashfs would have to fall back to using an intermediate buffer (the older method), which is slower, involving a memcopy, and it introduces contention on a shared buffer. The first patch extends the "page actor" code to handle missing pages. The second patch updates Squashfs_readpage_block() to use the new functionality, and removes the code that falls back to using an intermediate buffer. This patchset is independent of the readahead work, and it is standalone. It can be merged on its own. But the readahead patch for efficiency also needs this patch-set. This patch (of 2): This patch extends the "page actor" code to handle missing pages. Previously if the full set of pages needed to decompress a Squashfs datablock was unavailable, this would cause decompression to fail on the missing pages. In this case direct decompression into the page cache could not be achieved and the code would fall back to using the older intermediate buffer method. With this patch, direct decompression into the page cache can be achieved with missing pages. For "multi-shot" decompressors (zlib, xz, zstd), the page actor will allocate a temporary buffer which is passed to the decompressor, and then freed by the page actor. For "single shot" decompressors (lz4, lzo) which decompress into a contiguous "bounce buffer", and which is then copied into the page cache, it would be pointless to allocate a temporary buffer, memcpy into it, and then free it. For these decompressors -ENOMEM is returned, which signifies that the memcpy for that page should be skipped. This also happens if the data block is uncompressed. Link: https://lkml.kernel.org/r/20220611032133.5743-1-phillip@squashfs.org.uk Link: https://lkml.kernel.org/r/20220611032133.5743-2-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Xiongwei Song <Xiongwei.Song@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17fs/kernel_read_file: allow to read files up-to ssize_tPasha Tatashin1-19/+19
Patch series "Allow to kexec with initramfs larger than 2G", v2. Currently, the largest initramfs that is supported by kexec_file_load() syscall is 2G. This is because kernel_read_file() returns int, and is limited to INT_MAX or 2G. On the other hand, there are kexec based boot loaders (i.e. u-root), that may need to boot netboot images that might be larger than 2G. The first patch changes the return type from int to ssize_t in kernel_read_file* functions. The second patch increases the maximum initramfs file size to 4G. Tested: verified that can kexec_file_load() works with 4G initramfs on x86_64. This patch (of 2): Currently, the maximum file size that is supported is 2G. This may be too small in some cases. For example, kexec_file_load() system call loads initramfs. In some netboot cases initramfs can be rather large. Allow to use up-to ssize_t bytes. The callers still can limit the maximum file size via buf_size. Link: https://lkml.kernel.org/r/20220527025535.3953665-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20220527025535.3953665-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17ocfs2: kill EBUSY from dlmfs_evict_inodeJunxiao Bi1-3/+11
When unlinking a dlmfs, first it will invoke dlmfs_unlink(), and then invoke dlmfs_evict_inode(), user_dlm_destroy_lock() is invoked in both places, the second one from dlmfs_evict_inode() will get EBUSY error because USER_LOCK_IN_TEARDOWN is already set in lockres. This doesn't affect any function, just the error log is annoying. Link: https://lkml.kernel.org/r/20220607171226.86672-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17hugetlbfs: zero partial pages during fallocate hole punchMike Kravetz1-15/+53
hugetlbfs fallocate support was originally added with commit 70c3547e36f5 ("hugetlbfs: add hugetlbfs_fallocate()"). Initial support only operated on whole hugetlb pages. This makes sense for populating files as other interfaces such as mmap and truncate require hugetlb page size alignment. Only operating on whole hugetlb pages for the hole punch case was a simplification and there was no compelling use case to zero partial pages. In a recent discussion[1] it was assumed that hugetlbfs hole punch would zero partial hugetlb pages as that is in line with the man page description saying 'partial filesystem blocks are zeroed'. However, the hugetlbfs hole punch code actually does this: hole_start = round_up(offset, hpage_size); hole_end = round_down(offset + len, hpage_size); Modify code to zero partial hugetlb pages in hole punch range. It is possible that application code could note a change in behavior. However, that would imply the code is passing in an unaligned range and expecting only whole pages be removed. This is unlikely as the fallocate documentation states the opposite. The current hugetlbfs fallocate hole punch behavior is tested with the libhugetlbfs test fallocate_align[2]. This test will be updated to validate partial page zeroing. [1] https://lore.kernel.org/linux-mm/20571829-9d3d-0b48-817c-b6b15565f651@redhat.com/ [2] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/fallocate_align.c Link: https://lkml.kernel.org/r/YqeiMlZDKI1Kabfe@monkey Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17smb3: add trace point for SMB2_set_eofSteve French2-0/+40
In order to debug problems with file size being reported incorrectly temporarily (in this case xfstest generic/584 intermittent failure) we need to add trace point for the non-compounded code path where we set the file size (SMB2_set_eof). The new trace point is: "smb3_set_eof" Here is sample output from the tracepoint: TASK-PID CPU# ||||| TIMESTAMP FUNCTION | | | ||||| | | xfs_io-75403 [002] ..... 95219.189835: smb3_set_eof: xid=221 sid=0xeef1cbd2 tid=0x27079ee6 fid=0x52edb58c offset=0x100000 aio-dio-append--75418 [010] ..... 95219.242402: smb3_set_eof: xid=226 sid=0xeef1cbd2 tid=0x27079ee6 fid=0xae89852d offset=0x0 Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-179p: fix EBADF errors in cached modeDominique Martinet1-0/+13
cached operations sometimes need to do invalid operations (e.g. read on a write only file) Historic fscache had added a "writeback fid", a special handle opened RW as root, for this. The conversion to new fscache missed that bit. This commit reinstates a slightly lesser variant of the original code that uses the writeback fid for partial pages backfills if the regular user fid had been open as WRONLY, and thus would lack read permissions. Link: https://lkml.kernel.org/r/20220614033802.1606738-1-asmadeus@codewreck.org Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching") Cc: stable@vger.kernel.org Cc: David Howells <dhowells@redhat.com> Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-16ext4: improve write performance with disabled delallocJan Kara1-1/+1
When delayed allocation is disabled (either through mount option or because we are running low on free space), ext4_write_begin() allocates blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent merging is disabled and since ext4_write_begin() is called for each page separately, we end up with a *lot* of 1 block extents in the extent tree and following writeback is writing 1 block at a time which results in very poor write throughput (4 MB/s instead of 200 MB/s). These days when ext4_get_block_unwritten() is used only by ext4_write_begin(), ext4_page_mkwrite() and inline data conversion, we can safely allow extent merging to happen from these paths since following writeback will happen on different boundaries anyway. So use EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16ext4: fix warning when submitting superblock in ext4_commit_super()Zhang Yi1-6/+16
We have already check the io_error and uptodate flag before submitting the superblock buffer, and re-set the uptodate flag if it has been failed to write out. But it was lockless and could be raced by another ext4_commit_super(), and finally trigger '!uptodate' WARNING when marking buffer dirty. Fix it by submit buffer directly. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220520023216.3065073-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16io_uring: do not use prio task_work_add in uring_cmdDylan Yudaken1-1/+1
io_req_task_prio_work_add has a strict assumption that it will only be used with io_req_task_complete. There is a codepath that assumes this is the case and will not even call the completion function if it is hit. For uring_cmd with an arbitrary completion function change the call to the correct non-priority version. Fixes: ee692a21e9bf8 ("fs,io_uring: add infrastructure for uring-cmd") Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220616135011.441980-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16ext4: fix incorrect comment in ext4_bio_write_page()Wang Jianjian1-1/+1
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-1-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16fs: fix jbd2_journal_try_to_free_buffers() kernel-doc commentYang Li1-1/+1
Add the description of @folio and remove @page in function kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/jbd2/transaction.c:2149: warning: Function parameter or member 'folio' not described in 'jbd2_journal_try_to_free_buffers' fs/jbd2/transaction.c:2149: warning: Excess function parameter 'page' description in 'jbd2_journal_try_to_free_buffers' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220512075432.31763-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16io_uring: commit non-pollable provided mapped buffers upfrontJens Axboe1-1/+1
For recv/recvmsg, IO either completes immediately or gets queued for a retry. This isn't the case for read/readv, if eg a normal file or a block device is used. Here, an operation can get queued with the block layer. If this happens, ring mapped buffers must get committed immediately to avoid that the next read can consume the same buffer. Check if we're dealing with pollable file, when getting a new ring mapped provided buffer. If it's not, commit it immediately rather than wait post issue. If we don't wait, we can race with completions coming in, or just plain buffer reuse by committing after a retry where others could have grabbed the same buffer. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16ext2: fix fs corruption when trying to remove a non-empty directory with IO ↵Ye Bin1-6/+3
error We got issue as follows: [home]# mount /dev/sdd test [home]# cd test [test]# ls dir1 lost+found [test]# rmdir dir1 ext2_empty_dir: inject fault [test]# ls lost+found [test]# cd .. [home]# umount test [home]# fsck.ext2 -fn /dev/sdd e2fsck 1.42.9 (28-Dec-2013) Pass 1: Checking inodes, blocks, and sizes Inode 4065, i_size is 0, should be 1024. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 4065 (/???) Connect to /lost+found? no '..' in ... (4065) is / (2), should be <The NULL inode> (0). Fix? no Pass 4: Checking reference counts Inode 2 ref count is 3, should be 4. Fix? no Inode 4065 ref count is 2, should be 3. Fix? no Pass 5: Checking group summary information /dev/sdd: ********** WARNING: Filesystem still has errors ********** /dev/sdd: 14/128016 files (0.0% non-contiguous), 18477/512000 blocks Reason is same with commit 7aab5c84a0f6. We can't assume directory is empty when read directory entry failed. Link: https://lore.kernel.org/r/20220615090010.1544152-1-yebin10@huawei.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-16xfs: preserve DIFLAG2_NREXT64 when setting other inode attributesDarrick J. Wong1-1/+2
It is vitally important that we preserve the state of the NREXT64 inode flag when we're changing the other flags2 fields. Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-16xfs: fix variable state usageDarrick J. Wong1-2/+1
The variable @args is fed to a tracepoint, and that's the only place it's used. This is fine for the kernel, but for userspace, tracepoints are #define'd out of existence, which results in this warning on gcc 11.2: xfs_attr.c: In function ‘xfs_attr_node_try_addname’: xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable] 1440 | struct xfs_da_args *args = attr->xattri_da_args; | ^~~~ Clean this up. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-16xfs: fix TOCTOU race involving the new logged xattrs control knobDarrick J. Wong6-22/+34
I found a race involving the larp control knob, aka the debugging knob that lets developers enable logging of extended attribute updates: Thread 1 Thread 2 echo 0 > /sys/fs/xfs/debug/larp setxattr(REPLACE) xfs_has_larp (returns false) xfs_attr_set echo 1 > /sys/fs/xfs/debug/larp xfs_attr_defer_replace xfs_attr_init_replace_state xfs_has_larp (returns true) xfs_attr_init_remove_state <oops, wrong DAS state!> This isn't a particularly severe problem right now because xattr logging is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know what they're doing. However, the eventual intent is that callers should be able to ask for the assistance of the log in persisting xattr updates. This capability might not be required for /all/ callers, which means that dynamic control must work correctly. Once an xattr update has decided whether or not to use logged xattrs, it needs to stay in that mode until the end of the operation regardless of what subsequent parallel operations might do. Therefore, it is an error to continue sampling xfs_globals.larp once xfs_attr_change has made a decision about larp, and it was not correct for me to have told Allison that ->create_intent functions can sample the global log incompat feature bitfield to decide to elide a log item. Instead, create a new op flag for the xfs_da_args structure, and convert all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within the attr update state machine to look for the operations flag. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x fileDave Wysochanski2-0/+2
Commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add it for NFSv4.x. This causes direct io on NFSv4.x to fail open with EINVAL: mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4 dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct dd: failed to open '/mnt/nfs4/file.bin': Invalid argument dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct dd: failed to open '/mnt/dir1/file1.bin': Invalid argument Fixes: a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-15Merge tag 'fs.fixes.v5.19-rc3' of ↵Linus Torvalds1-6/+20
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull vfs idmapping fix from Christian Brauner: "This fixes an issue where we fail to change the group of a file when the caller owns the file and is a member of the group to change to. This is only relevant on idmapped mounts. There's a detailed description in the commit message and regression tests have been added to xfstests" * tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: account for group membership
2022-06-15io_uring: make io_fill_cqe_aux honour CQE32Pavel Begunkov1-0/+5
Don't let io_fill_cqe_aux() post 16B cqes for CQE32 rings, neither the kernel nor the userspace expect this to happen. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/64fae669fae1b7083aa15d0cd807f692b0880b9a.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15io_uring: remove __io_fill_cqe() helperPavel Begunkov1-21/+16
In preparation for the following patch, inline __io_fill_cqe(), there is only one user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71dab9afc3cde3f8b64d26f20d3b60bdc40726ff.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15io_uring: fix ->extra{1,2} misusePavel Begunkov1-2/+10
We don't really know the state of req->extra{1,2] fields in __io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option, it never sets them up properly. Track the state of those fields with a request flag. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4b3e5be512fbf4debec7270fd485b8a3b014d464.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15io_uring: fill extra big cqe fields from reqPavel Begunkov1-68/+10
The only user of io_req_complete32()-like functions is cmd requests. Instead of keeping the whole complete32 family, remove them and provide the extras in already added for inline completions req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled it'll use those fields to fill a 32B cqe. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/af1319eb661b1f9a0abceb51cbbf72b8002e019d.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15io_uring: unite fill_cqe and the 32B versionPavel Begunkov1-19/+42
We want just one function that will handle both normal cqes and 32B cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still not entirely correct yet, but saves us from cases when we fill an CQE of a wrong size. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8085c5b2f74141520f60decd45334f87e389b718.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15io_uring: get rid of __io_fill_cqe{32}_req()Pavel Begunkov1-49/+21
There are too many cqe filling helpers, kill __io_fill_cqe{32}_req(), use __io_fill_cqe{32}_req_filled() instead, and then rename it. It'll simplify fixing in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c18e0d191014fb574f24721245e4e3fddd0b6917.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-159p: Fix refcounting during full path walks for fid lookupsTyler Hicks1-13/+9
Decrement the refcount of the parent dentry's fid after walking each path component during a full path walk for a lookup. Failure to do so can lead to fids that are not clunked until the filesystem is unmounted, as indicated by this warning: 9pnet: found fid 3 not clunked The improper refcounting after walking resulted in open(2) returning -EIO on any directories underneath the mount point when using the virtio transport. When using the fd transport, there's no apparent issue until the filesytem is unmounted and the warning above is emitted to the logs. In some cases, the user may not yet be attached to the filesystem and a new root fid, associated with the user, is created and attached to the root dentry before the full path walk is performed. Increment the new root fid's refcount to two in that situation so that it can be safely decremented to one after it is used for the walk operation. The new fid will still be attached to the root dentry when v9fs_fid_lookup_with_uid() returns so a final refcount of one is correct/expected. Link: https://lkml.kernel.org/r/20220527000003.355812-2-tyhicks@linux.microsoft.com Link: https://lkml.kernel.org/r/20220612085330.1451496-4-asmadeus@codewreck.org Fixes: 6636b6dcc3db ("9p: add refcount to p9_fid struct") Cc: stable@vger.kernel.org Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> [Dominique: fix clunking fid multiple times discussed in second link] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-159p: fix fid refcount leak in v9fs_vfs_get_linkDominique Martinet1-4/+4
we check for protocol version later than required, after a fid has been obtained. Just move the version check earlier. Link: https://lkml.kernel.org/r/20220612085330.1451496-3-asmadeus@codewreck.org Fixes: 6636b6dcc3db ("9p: add refcount to p9_fid struct") Cc: stable@vger.kernel.org Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-159p: fix fid refcount leak in v9fs_vfs_atomic_open_dotlDominique Martinet1-0/+3
We need to release directory fid if we fail halfway through open This fixes fid leaking with xfstests generic 531 Link: https://lkml.kernel.org/r/20220612085330.1451496-2-asmadeus@codewreck.org Fixes: 6636b6dcc3db ("9p: add refcount to p9_fid struct") Cc: stable@vger.kernel.org Reported-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-14io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOTPavel Begunkov1-9/+3
This partially reverts a7c41b4687f5902af70cd559806990930c8a307b Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some users, but it tries to do two things at a time and it's not clear how to handle errors and what to return in a single result field when one part fails and another completes well. Kill it for now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14Revert "io_uring: add buffer selection support to IORING_OP_NOP"Pavel Begunkov1-14/+1
This reverts commit 3d200242a6c968af321913b635fc4014b238cba4. Buffer selection with nops was used for debugging and benchmarking but is useless in real life. Let's revert it before it's released. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5012098ca6b51dfbdcb190f8c4e3c0bf1c965dc.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14Revert "io_uring: support CQE32 for nop operation"Pavel Begunkov1-20/+1
This reverts commit 2bb04df7c2af9dad5d28771c723bc39b01cf7df4. CQE32 nops were used for debugging and benchmarking but it doesn't target any real use case. Revert it, we can return it back if someone finds a good way to use it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5ff623d84ccb4b3f3b92a3ea41cdcfa612f3d96f.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14fs: account for group membershipChristian Brauner1-6/+20
When calling setattr_prepare() to determine the validity of the attributes the ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id. This is exactly the same for idmapped and non-idmapped mounts and allows callers to pass in the values they want to see written to inode->i_{g,u}id. When group ownership is changed a caller whose fsuid owns the inode can change the group of the inode to any group they are a member of. When searching through the caller's groups we need to use the gid mapped according to the idmapped mount otherwise we will fail to change ownership for unprivileged users. Consider a caller running with fsuid and fsgid 1000 using an idmapped mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the idmapped mount. The caller now requests the gid of the file to be changed to 1000 going through the idmapped mount. In the vfs we will immediately map the requested gid to the value that will need to be written to inode->i_gid and place it in attr->ia_gid. Since this idmapped mount maps 65534 to 1000 we place 65534 in attr->ia_gid. When we check whether the caller is allowed to change group ownership we first validate that their fsuid matches the inode's uid. The inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount. Since the caller's fsuid is 1000 we pass the check. We now check whether the caller is allowed to change inode->i_gid to the requested gid by calling in_group_p(). This will compare the passed in gid to the caller's fsgid and search the caller's additional groups. Since we're dealing with an idmapped mount we need to pass in the gid mapped according to the idmapped mount. This is akin to checking whether a caller is privileged over the future group the inode is owned by. And that needs to take the idmapped mount into account. Note, all helpers are nops without idmapped mounts. New regression test sent to xfstests. Link: https://github.com/lxc/lxd/issues/10537 Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts") Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org # 5.15+ CC: linux-fsdevel@vger.kernel.org Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-13Merge branch 'io_uring/io_uring-5.19' of https://github.com/isilence/linux ↵Jens Axboe1-25/+50
into io_uring-5.19 Pull io_uring fixes from Pavel. * 'io_uring/io_uring-5.19' of https://github.com/isilence/linux: io_uring: fix double unlock for pbuf select io_uring: kbuf: fix bug of not consuming ring buffer in partial io case io_uring: openclose: fix bug of closing wrong fixed file io_uring: fix not locked access to fixed buf table io_uring: fix races with buffer table unregister io_uring: fix races with file table unregister
2022-06-13io_uring: limit size of provided buffer ringDylan Yudaken1-0/+4
The type of head and tail do not allow more than 2^15 entries in a provided buffer ring, so do not allow this. At 2^16 while each entry can be indexed, there is no way to disambiguate full vs empty. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220613101157.3687-4-dylany@fb.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13io_uring: fix types in provided buffer ringDylan Yudaken1-3/+3
The type of head needs to match that of tail in order for rollover and comparisons to work correctly. Without this change the comparison of tail to head might incorrectly allow io_uring to use a buffer that userspace had not given it. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220613101157.3687-3-dylany@fb.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13io_uring: fix index calculationDylan Yudaken1-1/+1
When indexing into a provided buffer ring, do not subtract 1 from the index. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220613101157.3687-2-dylany@fb.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13io_uring: fix double unlock for pbuf selectPavel Begunkov1-3/+1
io_buffer_select(), which is the only caller of io_ring_buffer_select(), fully handles locking, mutex unlock in io_ring_buffer_select() will lead to double unlock. Fixes: c7fb19428d67d ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13io_uring: kbuf: fix bug of not consuming ring buffer in partial io caseHao Xu1-4/+16
When we use ring-mapped provided buffer, we should consume it before arm poll if partial io has been done. Otherwise the buffer may be used by other requests and thus we lost the data. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Hao Xu <howeyxu@tencent.com> [pavel: 5.19 rebase] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13io_uring: openclose: fix bug of closing wrong fixed fileHao Xu1-1/+1
Don't update ret until fixed file is closed, otherwise the file slot becomes the error code. Fixes: a7c41b4687f5 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots") Signed-off-by: Hao Xu <howeyxu@tencent.com> [pavel: 5.19 rebase] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13io_uring: fix not locked access to fixed buf tablePavel Begunkov1-17/+17
We can look inside the fixed buffer table only while holding ->uring_lock, however in some cases we don't do the right async prep for IORING_OP_{WRITE,READ}_FIXED ending up with NULL req->imu forcing making an io-wq worker to try to resolve the fixed buffer without proper locking. Move req->imu setup into early req init paths, i.e. io_prep_rw(), which is called unconditionally for rw requests and under uring_lock. Fixes: 634d00df5e1cf ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13io_uring: fix races with buffer table unregisterPavel Begunkov1-0/+7
Fixed buffer table quiesce might unlock ->uring_lock, potentially letting new requests to be submitted, don't allow those requests to use the table as they will race with unregistration. Reported-and-tested-by: van fantasy <g1042620637@gmail.com> Fixes: bd54b6fe3316ec ("io_uring: implement fixed buffers registration similar to fixed files") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13io_uring: fix races with file table unregisterPavel Begunkov1-0/+8
Fixed file table quiesce might unlock ->uring_lock, potentially letting new requests to be submitted, don't allow those requests to use the table as they will race with unregistration. Reported-and-tested-by: van fantasy <g1042620637@gmail.com> Fixes: 05f3fb3c53975 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-12Merge tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-14/+29
Pull cifs client fixes from Steve French: "Three reconnect fixes, all for stable as well. One of these three reconnect fixes does address a problem with multichannel reconnect, but this does not include the additional fix (still being tested) for dynamically detecting multichannel adapter changes which will improve those reconnect scenarios even more" * tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: populate empty hostnames for extra channels cifs: return errors during session setup during reconnects cifs: fix reconnect on smb3 mount types
2022-06-11Merge tag 'nfsd-5.19-1' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable changes: - There is now a backup maintainer for NFSD Notable fixes: - Prevent array overruns in svc_rdma_build_writes() - Prevent buffer overruns when encoding NFSv3 READDIR results - Fix a potential UAF in nfsd_file_put()" * tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_commit_encode() SUNRPC: Optimize xdr_reserve_space() SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer() SUNRPC: Trap RDMA segment overflows NFSD: Fix potential use-after-free in nfsd_file_put() MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
2022-06-11cifs: populate empty hostnames for extra channelsShyam Prasad N2-1/+8
Currently, the secondary channels of a multichannel session also get hostname populated based on the info in primary channel. However, this will end up with a wrong resolution of hostname to IP address during reconnect. This change fixes this by not populating hostname info for all secondary channels. Fixes: 5112d80c162f ("cifs: populate server_hostname for extra channels") Cc: stable@vger.kernel.org Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>