summaryrefslogtreecommitdiff
path: root/fs/nfs/dir.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-24nfs: fix redundant readdir request after get eofKinglong Mee1-4/+11
When a directory contains 17 files (except . and ..), nfs client sends a redundant readdir request after get eof. A simple reproduce, At NFS server, create a directory with 17 files under exported directory. # mkdir test # cd test # for i in {0..16} ; do touch $i; done At NFS client, no matter mounting through nfsv3 or nfsv4, does ls (or ll) at the created test directory. A tshark output likes following (for nfsv4), # tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4 srcip dstip SEQUENCE, PUTFH, READDIR 0 dstip srcip SEQUENCE PUTFH READDIR 909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807 srcip dstip srcip dstip SEQUENCE, PUTFH, READDIR 9223372036854775807 dstip srcip SEQUENCE PUTFH READDIR The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19NFS: Convert kmap_atomic() to kmap_local_folio()Fabio M. De Francesco1-2/+2
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}(). Therefore, replace kmap_atomic() with kmap_local_folio() in nfs_readdir_folio_array_append(). kmap_atomic() disables page-faults and preemption (the latter only for !PREEMPT_RT kernels), However, the code within the mapping/un-mapping in nfs_readdir_folio_array_append() does not depend on the above-mentioned side effects. Therefore, a mere replacement of the old API with the new one is all that is required (i.e., there is no need to explicitly add any calls to pagefault_disable() and/or preempt_disable()). Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Fixes: ec108d3cc766 ("NFS: Convert readdir page array functions to use a folio") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-09nfs: fix another case of NULL/IS_ERR confusion wrt folio pointersLinus Torvalds1-1/+1
Dan has been improving on the smatch error pointer checks, and pointed at another case where the __filemap_get_folio() conversion to error pointers had been overlooked. This time because it was hidden behind the filemap_grab_folio() helper function that is a wrapper around it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-06nfs: fix mis-merged __filemap_get_folio() error checkLinus Torvalds1-1/+1
Fix another case of an incorrect check for the returned 'folio' value from __filemap_get_folio(). The failure case used to return NULL, but was changed by commit 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio"). But in the meantime, commit ec108d3cc766 ("NFS: Convert readdir page array functions to use a folio") added a new user of that function. And my merge of the two did not fix this up correctly. The ext4 merge had the same issue, but that one had been caught in linux-next and got properly fixed while merging. Fixes: 0127f25b5dfc ("Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs") Cc: Anna Schumaker <anna@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-12NFS: Convert readdir page array functions to use a folioAnna Schumaker1-102/+99
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-12NFS: Convert the readdir array-of-pages into an array-of-foliosAnna Schumaker1-64/+65
This patch only converts the actual array, but doesn't touch the individual nfs_cache_array pages and related functions (that will be done in the next patch). I also adjust the names of the fields in the nfs_readdir_descriptor to say "folio" instead of "page". Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-03-14NFS: Correct timing for assigning access cache timestampChengen Du1-1/+1
When the user's login time is newer than the cache's timestamp, the original entry in the RB-tree will be replaced by a new entry. Currently, the timestamp is only set if the entry is not found in the RB-tree, which can cause the timestamp to be undefined when the entry exists. This may result in a significant increase in ACCESS operations if the timestamp is set to zero. Signed-off-by: Chengen Du <chengen.du@canonical.com> Fixes: 0eb43812c027 ("NFS: Clear the file access cache upon login”) Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-23Merge tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-14/+14
Pull NFS client updates from Anna Schumaker: "New Features: - Convert the read and write paths to use folios Bugfixes and Cleanups: - Fix tracepoint state manager flag printing - Fix disabling swap files - Fix NFSv4 client identifier sysfs path in the documentation - Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED - Treat GETDEVICEINFO errors as a layout failure - Replace kmap_atomic() calls with kmap_local_page() - Constify sunrpc sysfs kobj_type structures" * tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (25 commits) fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.c pNFS/filelayout: treat GETDEVICEINFO errors as layout failure Documentation: Fix sysfs path for the NFSv4 client identifier nfs42: do not fail with EIO if ssc returns NFS4ERR_OFFLOAD_DENIED NFS: fix disabling of swap SUNRPC: make kobj_type structures constant nfs4trace: fix state manager flag printing NFS: Remove unnecessary check in nfs_read_folio() NFS: Improve tracing of nfs_wb_folio() NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio() NFS: fix up nfs_release_folio() to try to release the page NFS: Clean up O_DIRECT request allocation NFS: Fix up nfs_vm_page_mkwrite() for folios NFS: Convert nfs_write_begin/end to use folios NFS: Remove unused function nfs_wb_page() NFS: Convert buffered writes to use folios NFS: Convert the function nfs_wb_page() to use folios NFS: Convert buffered reads to use folios NFS: Add a helper nfs_wb_folio() NFS: Convert the remaining pagelist helper functions to support folios ...
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-15fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.cFabio M. De Francesco1-14/+14
kmap_atomic() is deprecated in favor of kmap_local_page(). With kmap_local_page() the mappings are per thread, CPU local, can take page-faults, and can be called from any context (including interrupts). Furthermore, the tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and still valid. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only for !PREEMPT_RT kernels, otherwise it only disables migration). The code within the mappings/un-mappings in the functions of dir.c don't depend on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly add calls to pagefault_disable() and/or preempt_disable()). Therefore, replace kmap_atomic() with kmap_local_page() in fs/nfs/dir.c. Tested in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner1-2/+2
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->rename() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->mknod() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->mkdir() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->symlink() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->create() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-02NFS: Fix up a sparse warningTrond Myklebust1-1/+3
sparse is warning about an incorrect RCU dereference. fs/nfs/dir.c:2965:56: warning: incorrect type in argument 1 (different address spaces) fs/nfs/dir.c:2965:56: expected struct cred const * fs/nfs/dir.c:2965:56: got struct cred const [noderef] __rcu *const cred Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-01-02NFS: Judge the file access cache's timestamp in rcu pathChengen Du1-0/+3
If the user's login time is newer than the cache's timestamp, we expect the cache may be stale and need to clear. The stale cache will remain in the list's tail if no other users operate on that inode. Once the user accesses the inode, the stale cache will be returned in rcu path. Signed-off-by: Chengen Du <chengen.du@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-11-28NFS: Trigger the "ls -l" readdir heuristic soonerBenjamin Coddington1-2/+7
Since commit 1a34c8c9a49e ("NFS: Support larger readdir buffers") has updated dtsize, and with recent improvements to the READDIRPLUS helper heuristic, the heuristic may not trigger until many dentries are emitted to userspace. This will cause many thousands of GETATTR calls for "ls -l" when the directory's pagecache has already been populated. This manifests as poor performance for long directory listings after an initially fast "ls -l". Fix this by emitting only 17 entries for any first pass through the NFS directory's ->iterate_shared(), which allows userpace to prime the counters for the heuristic. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-11-28NFS: Clear the file access cache upon loginTrond Myklebust1-0/+23
POSIX typically only refreshes the user's supplementary group information upon login. Since NFS servers may often refresh their concept of the user supplementary group membership at their own cadence, it is possible for the NFS client's access cache to become stale due to the user's group membership changing on the server after the user has already logged in on the client. While it is reasonable to expect that such group membership changes are rare, and that we do not want to optimise the cache to accommodate them, it is also not unreasonable for the user to expect that if they log out and log back in again, that the staleness would clear up. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-10-27nfs: Remove redundant null checks before kfreeYushan Zhou1-3/+2
Fix the following coccicheck warning: fs/nfs/dir.c:2494:2-7: WARNING: NULL check before some freeing functions is not needed. Signed-off-by: Yushan Zhou <katrinzhou@tencent.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-07Merge tag 'pull-file_inode' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file_inode() updates from Al Vrio: "whack-a-mole: cropped up open-coded file_inode() uses..." * tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: use ->f_mapping _nfs42_proc_copy(): use ->f_mapping instead of file_inode()->i_mapping dma_buf: no need to bother with file_inode()->i_mapping nfs_finish_open(): don't open-code file_inode() bprm_fill_uid(): don't open-code file_inode() sgx: use ->f_mapping... exfat_iterate(): don't open-code file_inode(file) ibmvmc: don't open-code file_inode()
2022-09-02nfs_finish_open(): don't open-code file_inode()Al Viro1-1/+1
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-20NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENTNeilBrown1-1/+2
nfs_unlink() calls d_delete() twice if it receives ENOENT from the server - once in nfs_dentry_handle_enoent() from nfs_safe_remove and once in nfs_dentry_remove_handle_error(). nfs_rmddir() also calls it twice - the nfs_dentry_handle_enoent() call is direct and inside a region locked with ->rmdir_sem It is safe to call d_delete() twice if the refcount > 1 as the dentry is simply unhashed. If the refcount is 1, the first call sets d_inode to NULL and the second call crashes. This patch guards the d_delete() call from nfs_dentry_handle_enoent() leaving the one under ->remdir_sem in case that is important. In mainline it would be safe to remove the d_delete() call. However in older kernels to which this might be backported, that would change the behaviour of nfs_unlink(). nfs_unlink() used to unhash the dentry which resulted in nfs_dentry_handle_enoent() not calling d_delete(). So in older kernels we need the d_delete() in nfs_dentry_remove_handle_error() when called from nfs_unlink() but not when called from nfs_rmdir(). To make the code work correctly for old and new kernels, and from both nfs_unlink() and nfs_rmdir(), we protect the d_delete() call with simple_positive(). This ensures it is never called in a circumstance where it could crash. Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename") Fixes: 9019fb391de0 ("NFS: Label the dentry with a verifier in nfs_rmdir() and nfs_unlink()") Signed-off-by: NeilBrown <neilb@suse.de> Tested-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13NFS: Fix missing unlock in nfs_unlink()Sun Ke1-1/+3
Add the missing unlock before goto. Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename") Signed-off-by: Sun Ke <sunke32@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-08NFS: don't unhash dentry during unlink/renameNeilBrown1-18/+54
NFS unlink() (and rename over existing target) must determine if the file is open, and must perform a "silly rename" instead of an unlink (or before rename) if it is. Otherwise the client might hold a file open which has been removed on the server. Consequently if it determines that the file isn't open, it must block any subsequent opens until the unlink/rename has been completed on the server. This is currently achieved by unhashing the dentry. This forces any open attempt to the slow-path for lookup which will block on i_rwsem on the directory until the unlink/rename completes. A future patch will change the VFS to only get a shared lock on i_rwsem for unlink, so this will no longer work. Instead we introduce an explicit interlock. A special value is stored in dentry->d_fsdata while the unlink/rename is running and ->d_revalidate blocks while that value is present. When ->d_revalidate unblocks, the dentry will be invalid. This closes the race without requiring exclusion on i_rwsem. d_fsdata is already used in two different ways. 1/ an IS_ROOT directory dentry might have a "devname" stored in d_fsdata. Such a dentry doesn't have a name and so cannot be the target of unlink or rename. For safety we check if an old devname is still stored, and remove it if it is. 2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct nfs_unlinkdata' stored in d_fsdata. While this is set maydelete() will fail, so an unlink or rename will never proceed on such a dentry. Neither of these can be in effect when a dentry is the target of unlink or rename. So we can expect d_fsdata to be NULL, and store a special value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate that any lookup will be blocked. The d_count() is incremented under d_lock() when a lookup finds the dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under the same lock to avoid any races. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-12nfs: Replace kmap() with kmap_local_page()Fabio M. De Francesco1-2/+2
The use of kmap() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Furthermore, the mapping can be acquired from any context (including interrupts). Therefore, use kmap_local_page() in nfs_do_filldir() because this mapping is per thread, CPU local, and not globally visible. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-11NFS: Fix case insensitive renamesTrond Myklebust1-0/+4
For filesystems that are case insensitive and case preserving, we need to be able to rename from one case folded variant of the filename to another. Currently, if we have looked up the target filename before the call to rename, then we may have a hashed dentry with that target name in the dcache, causing the vfs to optimise away the rename. To avoid that, let's drop the target dentry, and leave it to the server to optimise away the rename if that is the correct thing to do. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-06-15NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x fileDave Wysochanski1-0/+1
Commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add it for NFSv4.x. This causes direct io on NFSv4.x to fail open with EINVAL: mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4 dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct dd: failed to open '/mnt/nfs4/file.bin': Invalid argument dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct dd: failed to open '/mnt/dir1/file1.bin': Invalid argument Fixes: a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-10nfs: Convert to free_folioMatthew Wilcox (Oracle)1-2/+7
Add a wrapper that converts back from the folio to the page. This entire file needs to be converted to use folios, but that's a task for a different set of patches. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-04-07NFS: Replace readdir's use of xxhash() with hash_64()Trond Myklebust1-6/+3
Both xxhash() and hash_64() appear to give similarly low collision rates with a standard linearly increasing readdir offset. They both give similarly higher collision rates when applied to ext4's offsets. So switch to using the standard hash_64(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-30NFSv4: fix open failure with O_ACCMODE flagChenXiaoSong1-10/+0
open() with O_ACCMODE|O_DIRECT flags secondly will fail. Reproducer: 1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/ 2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT) 3. close(fd) 4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT) Server nfsd4_decode_share_access() will fail with error nfserr_bad_xdr when client use incorrect share access mode of 0. Fix this by using NFS4_SHARE_ACCESS_BOTH share access mode in client, just like firstly opening. Fixes: ce4ef7c0a8a05 ("NFS: Split out NFS v4 file operations") Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22NFS: Fix revalidation of empty readdir pagesTrond Myklebust1-2/+7
If the page is empty, we need to check the array->last_cookie instead of the first entry. Add a helper for the cases where we care. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22NFS: Don't deadlock when cookie hashes collideTrond Myklebust1-11/+18
In the very rare case where the readdir reply contains multiple cookies that map to the same hash value, we can end up deadlocking waiting for a page lock that we already hold. In this case we should fail the page lock by using grab_cache_page_nowait(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Cache all entries in the readdirplus replyTrond Myklebust1-14/+26
Even if we're not able to cache all the entries in the readdir buffer, let's ensure that we do prime the dcache. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Optimise away the previous cookie fieldTrond Myklebust1-12/+14
Replace the 'previous cookie' field in struct nfs_entry with the array->last_cookie. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Fix up forced readdirplusTrond Myklebust1-17/+39
Avoid clearing the entire readdir page cache if we're just doing forced readdirplus for the 'ls -l' heuristic. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Convert readdir page cache to use a cookie based indexTrond Myklebust1-84/+65
Instead of using a linear index to address the pages, use the cookie of the first entry, since that is what we use to match the page anyway. This allows us to avoid re-reading the entire cache on a seekdir() type of operation. The latter is very common when re-exporting NFS, and is a major performance drain. The change does affect our duplicate cookie detection, since we can no longer rely on the page index as a linear offset for detecting whether we looped backwards. However since we no longer do a linear search through all the pages on each call to nfs_readdir(), this is less of a concern than it was previously. The other downside is that invalidate_mapping_pages() no longer can use the page index to avoid clearing pages that have been read. A subsequent patch will restore the functionality this provides to the 'ls -l' heuristic. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Clean up page array initialisation/freeTrond Myklebust1-10/+6
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Trace effects of the readdirplus heuristicTrond Myklebust1-1/+10
Enable tracking of when the readdirplus heuristic causes a page cache invalidation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Trace effects of readdirplus on the dcacheTrond Myklebust1-0/+5
Trace the effects of readdirplus on attribute and dentry revalidation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Add basic readdir tracingTrond Myklebust1-1/+12
Add tracing to track how often the client goes to the server for updated readdir information. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Don't request readdirplus when revalidation was forcedTrond Myklebust1-10/+16
If the revalidation was forced, due to the presence of a LOOKUP_EXCL or a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help when we're doing a path component lookup. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Readdirplus can't help lookup for case insensitive filesystemsTrond Myklebust1-0/+2
If the filesystem is case insensitive, then readdirplus can't help with cache misses, since it won't return case folded variants of the filename. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Improve heuristic for readdirplusTrond Myklebust1-29/+51
The heuristic for readdirplus is designed to try to detect 'ls -l' and similar patterns. It does so by looking for cache hit/miss patterns in both the attribute cache and in the dcache of the files in a given directory, and then sets a flag for the readdirplus code to interpret. The problem with this approach is that a single attribute or dcache miss can cause the NFS code to force a refresh of the attributes for the entire set of files contained in the directory. To be able to make a more nuanced decision, let's sample the number of hits and misses in the set of open directory descriptors. That allows us to set thresholds at which we start preferring READDIRPLUS over regular READDIR, or at which we start to force a re-read of the remaining readdir cache using READDIRPLUS. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Reduce use of uncached readdirTrond Myklebust1-20/+3
When reading a very large directory, we want to try to keep the page cache up to date if doing so is inexpensive. With the change to allow readdir to continue reading even when the cache is incomplete, we no longer need to fall back to uncached readdir in order to scale to large directories. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Simplify nfs_readdir_xdr_to_array()Trond Myklebust1-18/+11
Recent changes to readdir mean that we can cope with partially filled page cache entries, so we no longer need to rely on looping in nfs_readdir_xdr_to_array(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: If the cookie verifier changes, we must invalidate the page cacheTrond Myklebust1-1/+6
Ensure that if the cookie verifier changes when we use the zero-valued cookie, then we invalidate any cached pages. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Adjust the amount of readahead performed by NFS readdirTrond Myklebust1-1/+52
The current NFS readdir code will always try to maximise the amount of readahead it performs on the assumption that we can cache anything that isn't immediately read by the process. There are several cases where this assumption breaks down, including when the 'ls -l' heuristic kicks in to try to force use of readdirplus as a batch replacement for lookup/getattr. This patch therefore tries to tone down the amount of readahead we perform, and adjust it to try to match the amount of data being requested by user space. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02NFS: Don't advance the page pointer unless the page is fullTrond Myklebust1-10/+22
When we hit the end of the data in the readdir page, we don't want to start filling a new page, unless this one is full. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>