summaryrefslogtreecommitdiff
path: root/fs/dcache.c
AgeCommit message (Collapse)AuthorFilesLines
2019-02-06fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()Waiman Long1-5/+1
commit 1dbd449c9943e3145148cc893c2461b72ba6fef0 upstream. The nr_dentry_unused per-cpu counter tracks dentries in both the LRU lists and the shrink lists where the DCACHE_LRU_LIST bit is set. The shrink_dcache_sb() function moves dentries from the LRU list to a shrink list and subtracts the dentry count from nr_dentry_unused. This is incorrect as the nr_dentry_unused count will also be decremented in shrink_dentry_list() via d_shrink_del(). To fix this double decrement, the decrement in the shrink_dcache_sb() function is taken out. Fixes: 4e717f5c1083 ("list_lru: remove special case function list_lru_dispose_all." Cc: stable@kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-17root dentries need RCU-delayed freeingAl Viro1-2/+4
commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream. Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30do d_instantiate/unlock_new_inode combinations safelyAl Viro1-0/+22
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22lock_parent() needs to recheck if dentry got __dentry_kill'ed under itAl Viro1-3/+8
commit 3b821409632ab778d46e807516b457dfa72736ed upstream. In case when dentry passed to lock_parent() is protected from freeing only by the fact that it's on a shrink list and trylock of parent fails, we could get hit by __dentry_kill() (and subsequent dentry_kill(parent)) between unlocking dentry and locking presumed parent. We need to recheck that dentry is alive once we lock both it and parent *and* postpone rcu_read_unlock() until after that point. Otherwise we could return a pointer to struct dentry that already is rcu-scheduled for freeing, with ->d_lock held on it; caller's subsequent attempt to unlock it can end up with memory corruption. Cc: stable@vger.kernel.org # 3.12+, counting backports Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11dentry name snapshotsAl Viro1-0/+27
commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream. take_dentry_name_snapshot() takes a safe snapshot of dentry name; if the name is a short one, it gets copied into caller-supplied structure, otherwise an extra reference to external name is grabbed (those are never modified). In either case the pointer to stable string is stored into the same structure. dentry must be held by the caller of take_dentry_name_snapshot(), but may be freely dropped afterwards - the snapshot will stay until destroyed by release_dentry_name_snapshot(). Intended use: struct name_snapshot s; take_dentry_name_snapshot(&s, dentry); ... access s.name ... release_dentry_name_snapshot(&s); Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name to pass down with event. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21fs/dcache.c: fix spin lockup issue on nlru->lockSahitya Tummala1-2/+3
commit b17c070fb624cf10162cf92ea5e1ec25cd8ac176 upstream. __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer duration if there are more number of items in the lru list. As per the current code, it can hold the spin lock for upto maximum UINT_MAX entries at a time. So if there are more number of items in the lru list, then "BUG: spinlock lockup suspected" is observed in the below path: spin_bug+0x90 do_raw_spin_lock+0xfc _raw_spin_lock+0x28 list_lru_add+0x28 dput+0x1c8 path_put+0x20 terminate_walk+0x3c path_lookupat+0x100 filename_lookup+0x6c user_path_at_empty+0x54 SyS_faccessat+0xd0 el0_svc_naked+0x24 This nlru->lock is acquired by another CPU in this path - d_lru_shrink_move+0x34 dentry_lru_isolate_shrink+0x48 __list_lru_walk_one.isra.10+0x94 list_lru_walk_node+0x40 shrink_dcache_sb+0x60 do_remount_sb+0xbc do_emergency_remount+0xb0 process_one_work+0x228 worker_thread+0x2e0 kthread+0xf4 ret_from_fork+0x10 Fix this lockup by reducing the number of entries to be shrinked from the lru list to 1024 at once. Also, add cond_resched() before processing the lru list again. Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-22fs/dcache.c: avoid soft-lockup in dput()Wei Fang1-2/+5
[ Upstream commit 47be61845c775643f1aa4d2a54343549f943c94c ] We triggered soft-lockup under stress test which open/access/write/close one file concurrently on more than five different CPUs: WARN: soft lockup - CPU#0 stuck for 11s! [who:30631] ... [<ffffffc0003986f8>] dput+0x100/0x298 [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60 [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8 [<ffffffc00038f780>] filename_lookup+0x38/0xf0 [<ffffffc000391180>] user_path_at_empty+0x78/0xd0 [<ffffffc0003911f4>] user_path_at+0x1c/0x28 [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230 ->d_lock trylock may failed many times because of concurrently operations, and dput() may execute a long time. Fix this by replacing cpu_relax() with cond_resched(). dput() used to be sleepable, so make it sleepable again should be safe. Cc: <stable@vger.kernel.org> Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-08-22dcache: let the dentry count go down to zero without taking d_lockLinus Torvalds1-4/+114
[ Upstream commit 360f54796ed65939093ae373b92ebd5ef3341776 ] We can be more aggressive about this, if we are clever and careful. This is subtle. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2016-06-20fix d_walk()/non-delayed __d_free() raceAl Viro1-2/+2
[ Upstream commit 3d56c25e3bb0726a5c5e16fc2d9e38f8ed763085 ] Ascend-to-parent logics in d_walk() depends on all encountered child dentries not getting freed without an RCU delay. Unfortunately, in quite a few cases it is not true, with hard-to-hit oopsable race as the result. Fortunately, the fix is simiple; right now the rule is "if it ever been hashed, freeing must be delayed" and changing it to "if it ever had a parent, freeing must be delayed" closes that hole and covers all cases the old rule used to cover. Moreover, pipes and sockets remain _not_ covered, so we do not introduce RCU delay in the cases which are the reason for having that delay conditional in the first place. Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28overlayfs: Make f_path always point to the overlay and f_inode to the underlayDavid Howells1-1/+4
[ Upstream commit 4bacc9c9234c7c8eec44f5ed4e960d9f96fa0f01 ] Make file->f_path always point to the overlay dentry so that the path in /proc/pid/fd is correct and to ensure that label-based LSMs have access to the overlay as well as the underlay (path-based LSMs probably don't need it). Using my union testsuite to set things up, before the patch I see: [root@andromeda union-testsuite]# bash 5</mnt/a/foo107 [root@andromeda union-testsuite]# ls -l /proc/$$/fd/ ... lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107 [root@andromeda union-testsuite]# stat /mnt/a/foo107 ... Device: 23h/35d Inode: 13381 Links: 1 ... [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5 ... Device: 23h/35d Inode: 13381 Links: 1 ... After the patch: [root@andromeda union-testsuite]# bash 5</mnt/a/foo107 [root@andromeda union-testsuite]# ls -l /proc/$$/fd/ ... lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107 [root@andromeda union-testsuite]# stat /mnt/a/foo107 ... Device: 23h/35d Inode: 40346 Links: 1 ... [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5 ... Device: 23h/35d Inode: 40346 Links: 1 ... Note the change in where /proc/$$/fd/5 points to in the ls command. It was pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107 (which is correct). The inode accessed, however, is the lower layer. The union layer is on device 25h/37d and the upper layer on 24h/36d. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28dcache: Handle escaped paths in prepend_pathEric W. Biederman1-0/+7
[ Upstream commit cde93be45a8a90d8c264c776fab63487b5038a65 ] A rename can result in a dentry that by walking up d_parent will never reach it's mnt_root. For lack of a better term I call this an escaped path. prepend_path is called by four different functions __d_path, d_absolute_path, d_path, and getcwd. __d_path only wants to see paths are connected to the root it passes in. So __d_path needs prepend_path to return an error. d_absolute_path similarly wants to see paths that are connected to some root. Escaped paths are not connected to any mnt_root so d_absolute_path needs prepend_path to return an error greater than 1. So escaped paths will be treated like paths on lazily unmounted mounts. getcwd needs to prepend "(unreachable)" so getcwd also needs prepend_path to return an error. d_path is the interesting hold out. d_path just wants to print something, and does not care about the weird cases. Which raises the question what should be printed? Given that <escaped_path>/<anything> should result in -ENOENT I believe it is desirable for escaped paths to be printed as empty paths. As there are not really any meaninful path components when considered from the perspective of a mount tree. So tweak prepend_path to return an empty path with an new error code of 3 when it encounters an escaped path. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-27freeing unlinked file indefinitely delayedAl Viro1-0/+3
[ Upstream commit 75a6f82a0d10ef8f13cd8fe7212911a0252ab99e ] Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/ Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-07-13vfs: Remove incorrect debugging WARN in prepend_pathEric W. Biederman1-11/+0
[ Upstream commit 93e3bce6287e1fb3e60d3324ed08555b5bbafa89 ] The warning message in prepend_path is unclear and outdated. It was added as a warning that the mechanism for generating names of pseudo files had been removed from prepend_path and d_dname should be used instead. Unfortunately the warning reads like a general warning, making it unclear what to do with it. Remove the warning. The transition it was added to warn about is long over, and I added code several years ago which in rare cases causes the warning to fire on legitimate code, and the warning is now firing and scaring people for no good reason. Cc: stable@vger.kernel.org Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-09d_walk() might skip too muchAl Viro1-4/+4
[ Upstream commit 2159184ea01e4ae7d15f2017e296d4bc82d5aeb0 ] when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Cc: stable@vger.kernel.org # 3.2 and later Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-01-08dcache: fix kmemcheck warning in switch_namesMikulas Patocka1-0/+2
commit 08d4f7722268755ee34ed1c9e8afee7dfff022bb upstream. This patch fixes kmemcheck warning in switch_names. The function switch_names swaps inline names of two dentries. It swaps full arrays d_iname, no matter how many bytes are really used by the strings. Reading data beyond string ends results in kmemcheck warning. We fix the bug by marking both arrays as fully initialized. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-16deal with deadlock in d_walk()Al Viro1-15/+16
commit ca5358ef75fc69fee5322a38a340f5739d997c10 upstream. ... by not hitting rename_retry for reasons other than rename having happened. In other words, do _not_ restart when finding that between unlocking the child and locking the parent the former got into __dentry_kill(). Skip the killed siblings instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-16move d_rcu from overlapping d_child to overlapping d_aliasAl Viro1-27/+26
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-19vfs: fix reference leak in d_prune_aliases()Yan, Zheng1-0/+1
In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old dget + d_drop + dput has been replaced with lock_parent + __dentry_kill; unfortunately, dput() does more than just killing dentry - it also drops the reference to parent. New variant leaks that reference and needs dput(parent) after killing the child off. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24fix inode leaks on d_splice_alias() failure exitsAl Viro1-0/+2
d_splice_alias() callers expect it to either stash the inode reference into a new alias, or drop the inode reference. That makes it possible to just return d_splice_alias() result from ->lookup() instance, without any extra housekeeping required. Unfortunately, that should include the failure exits. If d_splice_alias() returns an error, it leaves the dentry it has been given negative and thus it *must* drop the inode reference. Easily fixed, but it goes way back and will need backporting. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-13take dname_external() into fs/dcache.cAl Viro1-0/+5
never used outside and it's too low-level for legitimate uses outside of fs/dcache.c anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09dcache: Fix no spaces at the start of a line in dcache.cDaeseok Youn1-4/+4
Fixed coding style in dcache.c Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09dcache.c: call ->d_prune() regardless of d_unhashed()Al Viro1-1/+1
the only in-tree instance checks d_unhashed() anyway, out-of-tree code can preserve the current behaviour by adding such check if they want it and we get an ability to use it in cases where we *want* to be notified of killing being inevitable before ->d_lock is dropped, whether it's unhashed or not. In particular, autofs would benefit from that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09d_prune_alias(): just lock the parent and call __dentry_kill()Al Viro1-14/+7
The only reason for games with ->d_prune() was __d_drop(), which was needed only to force dput() into killing the sucker off. Note that lock_parent() can be called under ->i_lock and won't drop it, so dentry is safe from somebody managing to kill it under us - it won't happen while we are holding ->i_lock. __dentry_kill() is called only with ->d_lockref.count being 0 (here and when picked from shrink list) or 1 (dput() and dropping the ancestors in shrink_dentry_list()), so it will never be called twice - the first thing it's doing is making ->d_lockref.count negative and once that happens, nothing will increment it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09vfs: Make d_invalidate return voidEric W. Biederman1-12/+3
Now that d_invalidate can no longer fail, stop returning a useless return code. For the few callers that checked the return code update remove the handling of d_invalidate failure. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09vfs: Merge check_submounts_and_drop and d_invalidateEric W. Biederman1-33/+22
Now that d_invalidate is the only caller of check_submounts_and_drop, expand check_submounts_and_drop inline in d_invalidate. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09vfs: Lazily remove mounts on unlinked files and directories.Eric W. Biederman1-27/+33
With the introduction of mount namespaces and bind mounts it became possible to access files and directories that on some paths are mount points but are not mount points on other paths. It is very confusing when rm -rf somedir returns -EBUSY simply because somedir is mounted somewhere else. With the addition of user namespaces allowing unprivileged mounts this condition has gone from annoying to allowing a DOS attack on other users in the system. The possibility for mischief is removed by updating the vfs to support rename, unlink and rmdir on a dentry that is a mountpoint and by lazily unmounting mountpoints on deleted dentries. In particular this change allows rename, unlink and rmdir system calls on a dentry without a mountpoint in the current mount namespace to succeed, and it allows rename, unlink, and rmdir performed on a distributed filesystem to update the vfs cache even if when there is a mount in some namespace on the original dentry. There are two common patterns of maintaining mounts: Mounts on trusted paths with the parent directory of the mount point and all ancestory directories up to / owned by root and modifiable only by root (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu, cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories maintained by fusermount. In the case of mounts in trusted directories owned by root and modifiable only by root the current parent directory permissions are sufficient to ensure a mount point on a trusted path is not removed or renamed by anyone other than root, even if there is a context where the there are no mount points to prevent this. In the case of mounts in directories owned by less privileged users races with users modifying the path of a mount point are already a danger. fusermount already uses a combination of chdir, /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The removable of global rename, unlink, and rmdir protection really adds nothing new to consider only a widening of the attack window, and fusermount is already safe against unprivileged users modifying the directory simultaneously. In principle for perfect userspace programs returning -EBUSY for unlink, rmdir, and rename of dentires that have mounts in the local namespace is actually unnecessary. Unfortunately not all userspace programs are perfect so retaining -EBUSY for unlink, rmdir and rename of dentries that have mounts in the current mount namespace plays an important role of maintaining consistency with historical behavior and making imperfect userspace applications hard to exploit. v2: Remove spurious old_dentry. v3: Optimized shrink_submounts_and_drop Removed unsued afs label v4: Simplified the changes to check_submounts_and_drop Do not rename check_submounts_and_drop shrink_submounts_and_drop Document what why we need atomicity in check_submounts_and_drop Rely on the parent inode mutex to make d_revalidate and d_invalidate an atomic unit. v5: Refcount the mountpoint to detach in case of simultaneous renames. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09vfs: More precise tests in d_invalidateEric W. Biederman1-34/+4
The current comments in d_invalidate about what and why it is doing what it is doing are wildly off-base. Which is not surprising as the comments date back to last minute bug fix of the 2.2 kernel. The big fat lie of a comment said: If it's a directory, we can't drop it for fear of somebody re-populating it with children (even though dropping it would make it unreachable from that root, we still might repopulate it if it was a working directory or similar). [AV] What we really need to avoid is multiple dentry aliases of the same directory inode; on all filesystems that have ->d_revalidate() we either declare all positive dentries always valid (and thus never fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(), which take care of alias prevention. The current rules are: - To prevent mount point leaks dentries that are mount points or that have childrent that are mount points may not be be unhashed. - All dentries may be unhashed. - Directories may be rehashed with d_materialise_unique check_submounts_and_drop implements this already for well maintained remote filesystems so implement the current rules in d_invalidate by just calling check_submounts_and_drop. The one difference between d_invalidate and check_submounts_and_drop is that d_invalidate must respect it when a d_revalidate method has earlier called d_drop so preserve the d_unhashed check in d_invalidate. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09vfs: Document the effect of d_revalidate on d_find_aliasEric W. Biederman1-1/+2
d_drop or check_submounts_and_drop called from d_revalidate can result in renamed directories with child dentries being unhashed. These renamed and drop directory dentries can be rehashed after d_materialise_unique uses d_find_alias to find them. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09Allow sharing external names after __d_move()Al Viro1-16/+59
* external dentry names get a small structure prepended to them (struct external_name). * it contains an atomic refcount, matching the number of struct dentry instances that have ->d_name.name pointing to that external name. The first thing free_dentry() does is decrementing refcount of external name, so the instances that are between the call of free_dentry() and RCU-delayed actual freeing do not contribute. * __d_move(x, y, false) makes the name of x equal to the name of y, external or not. If y has an external name, extra reference is grabbed and put into x->d_name.name. If x used to have an external name, the reference to the old name is dropped and, should it reach zero, freeing is scheduled via kfree_rcu(). * free_dentry() in dentry with external name decrements the refcount of that name and, should it reach zero, does RCU-delayed call that will free both the dentry and external name. Otherwise it does what it used to do, except that __d_free() doesn't even look at ->d_name.name; it simply frees the dentry. All non-RCU accesses to dentry external name are safe wrt freeing since they all should happen before free_dentry() is called. RCU accesses might run into a dentry seen by free_dentry() or into an old name that got already dropped by __d_move(); however, in both cases dentry must have been alive and refer to that name at some point after we'd done rcu_read_lock(), which means that any freeing must be still pending. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-29missing data dependency barrier in prepend_name()Al Viro1-0/+5
AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have SMP alpha boxen to reproduce it on. However, it really looks like the race is real. CPU1: d_path() on /mnt/ramfs/<255-character>/foo CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character> CPU2 does d_alloc(), which allocates an external name, stores the name there including terminating NUL, does smp_wmb() and stores its address in dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move() old dentry over to that. ->d_name.name value ends up in that dentry. In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches ->d_name.name and ->d_name.len; the former ends up pointing to new name (64-byte kmalloc'ed array), the latter - 255 (length of the old name). Nothing to force the ordering there, and normally that would be OK, since we'd run into the terminating NUL and stop. Except that it's alpha, and we'd need a data dependency barrier to guarantee that we see that store of NUL __d_alloc() has done. In a similar situation dentry_cmp() would survive; it does explicit smp_read_barrier_depends() after fetching ->d_name.name. prepend_name() doesn't and it risks walking past the end of kmalloc'ed object and possibly oops due to taking a page fault in kernel mode. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27vfs: Don't exchange "short" filenames unconditionally.Mikhail Efremov1-9/+18
Only exchange source and destination filenames if flags contain RENAME_EXCHANGE. In case if executable file was running and replaced by other file /proc/PID/exe should still show correct file name, not the old name of the file by which it was replaced. The scenario when this bug manifests itself was like this: * ALT Linux uses rpm and start-stop-daemon; * during a package upgrade rpm creates a temporary file for an executable to rename it upon successful unpacking; * start-stop-daemon is run subsequently and it obtains the (nonexistant) temporary filename via /proc/PID/exe thus failing to identify the running process. Note that "long" filenames (> DNAiME_INLINE_LEN) are still exchanged without RENAME_EXCHANGE and this behaviour exists long enough (should be fixed too apparently). So this patch is just an interim workaround that restores behavior for "short" names as it was before changes introduced by commit da1ce0670c14 ("vfs: add cross-rename"). See https://lkml.org/lkml/2014/9/7/6 for details. AV: the comments about being more careful with ->d_name.hash than with ->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s, when we started to unhash the target instead of swapping hash chain positions followed by d_delete() as we used to do when dcache was first introduced. Acked-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: da1ce0670c14 "vfs: add cross-rename" Signed-off-by: Mikhail Efremov <sem@altlinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27fold swapping ->d_name.hash into switch_names()Linus Torvalds1-2/+1
and do it along with ->d_name.len there Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27fold unlocking the children into dentry_unlock_parents_for_move()Al Viro1-5/+4
... renaming it into dentry_unlock_for_move() and making it more symmetric with dentry_lock_for_move(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27kill __d_materialise_dentry()Al Viro1-44/+10
it folds into __d_move() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27__d_materialise_dentry(): flip the order of argumentsAl Viro1-24/+20
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry) case in __d_move(). A bit more and it'll fold in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27__d_move(): fold manipulations with ->d_child/->d_subdirsAl Viro1-5/+3
list_del() + list_add() is a slightly pessimised list_move() list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init() Interleaving those makes the resulting code even worse. And harder to follow... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27don't open-code d_rehash() in d_materialise_unique()Al Viro1-5/+1
... and get rid of duplicate BUG_ON() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27pull rehashing and unlocking the target dentry into __d_materialise_dentry()Al Viro1-7/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-15Merge branch 'for-linus' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "double iput() on failure exit in lustre, racy removal of spliced dentries from ->s_anon in __d_materialise_dentry() plus a bunch of assorted RCU pathwalk fixes" The RCU pathwalk fixes end up fixing a couple of cases where we incorrectly dropped out of RCU walking, due to incorrect initialization and testing of the sequence locks in some corner cases. Since dropping out of RCU walk mode forces the slow locked accesses, those corner cases slowed down quite dramatically. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: be careful with nd->inode in path_init() and follow_dotdot_rcu() don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() fix bogus read_seqretry() checks introduced in b37199e move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) [fix] lustre: d_make_root() does iput() on dentry allocation failure
2014-09-14move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)Al Viro1-2/+6
and lock the right list there Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13vfs: fix bad hashing of dentriesLinus Torvalds1-2/+1
Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by: Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07fs: mark __d_obtain_alias staticFengguang Wu1-1/+1
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: d_splice_alias should detect loopsJ. Bruce Fields1-0/+5
I believe this can only happen in the case of a corrupted filesystem. So -EIO looks like the appropriate error. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTEDJ. Bruce Fields1-6/+3
If we get to this point and discover the dentry is not a root dentry, or not DCACHE_DISCONNECTED--great, we always prefer that anyway. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: remove unused d_find_alias parameterJ. Bruce Fields1-8/+5
Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: d_obtain_alias callers don't all want DISCONNECTEDJ. Bruce Fields1-20/+49
There are a few d_obtain_alias callers that are using it to get the root of a filesystem which may already have an alias somewhere else. This is not the same as the filehandle-lookup case, and none of them actually need DCACHE_DISCONNECTED set. It isn't really a serious problem, but it would really be clearer if we reserved DCACHE_DISCONNECTED for those cases where it's actually needed. In the btrfs case this was causing a spurious printk from nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol", and this replaces that workaround. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: d_splice_alias should ignore DCACHE_DISCONNECTEDJ. Bruce Fields1-4/+4
Any IS_ROOT() alias should be safe to use; there's nothing special about DCACHE_DISCONNECTED dentries. Note that this is in fact useful for filesystems such as btrfs which can legimately encounter a directory with a preexisting IS_ROOT alias on a lookup that crosses into a subvolume. (Those aliases are currently marked DCACHE_DISCONNECTED--but not really for any good reason, and we'll change that soon.) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: d_splice_alias mustn't create directory aliasesJ. Bruce Fields1-2/+9
Currently if d_splice_alias finds a directory with an alias that is not IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory. Duplicate directory dentries are unacceptable; it is better just to error out. (In the case of a local filesystem the most likely case is filesystem corruption: for example, perhaps two directories point to the same child directory, and the other parent has already been found and cached.) Note that distributed filesystems may encounter this case in normal operation if a remote host moves a directory to a location different from the one we last cached in the dcache. For that reason, such filesystems should instead use d_materialise_unique, which tries to move the old directory alias to the right place instead of erroring out. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: close d_move race in d_splice_aliasJ. Bruce Fields1-1/+6
d_splice_alias will d_move an IS_ROOT() directory dentry into place if one exists. This should be safe as long as the dentry remains IS_ROOT, but I can't see what guarantees that: once we drop the i_lock all we hold here is the i_mutex on an unrelated parent directory. Instead copy the logic of d_materialise_unique. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07dcache: move d_splice_aliasJ. Bruce Fields1-52/+52
Just a trivial move to locate it near (similar) d_materialise_unique code and save some forward references in a following patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>