summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2017-02-09sunrpc/nfs: cleanup procfs/pipefs entry in cache_detailKinglong Mee1-2/+1
Record flush/channel/content entries is useless, remove them. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09nfs: no PG_private waiters remain, remove wakerNicholas Piggin1-2/+0
Since commit 4f52b6bb ("NFS: Don't call COMMIT in ->releasepage()"), no tasks wait on PagePrivate, so the wake introduced in commit 95905446 ("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can be removed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09NFS: nfs_rename() handle -ERESTARTSYS dentry left behindBenjamin Coddington1-11/+25
An interrupted rename will leave the old dentry behind if the rename succeeds. Fix this by moving the final local work of the rename to rpc_call_done so that the results of the RENAME can always be handled, even if the original process has already returned with -ERESTARTSYS. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFSv4: Fix warning for using 0 as NULLWei Yongjun1-1/+1
Fixes the following sparse warning: fs/nfs/nfs4state.c:862:60: warning: Using plain integer as NULL pointer Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30pNFS/flexfiles: Make local symbol layoutreturn_ops staticWei Yongjun1-1/+1
Fixes the following sparse warning: fs/nfs/flexfilelayout/flexfilelayout.c:2114:34: warning: symbol 'layoutreturn_ops' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Return the comparison result directly in nfs41_match_stateid()Anna Schumaker1-3/+1
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Clean up nfs41_same_server_scope()Anna Schumaker1-5/+3
The function is cleaner this way, since we can use the result of memcmp() directly Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: No need to set and return status in nfs41_lock_expired()Anna Schumaker1-2/+1
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Remove unnecessary goto in nfs4_lookup_root_sec()Anna Schumaker1-8/+3
Once again, it's easier and cleaner just to return the error directly. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Remove nfs4_recover_expired_lease()Anna Schumaker1-6/+1
This function doesn't add much, since all it does is access the server's nfs_client variable. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Remove an extra if in _nfs4_recover_proc_open()Anna Schumaker1-4/+1
It's simpler just to return the status unconditionally Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Return errors directly in _nfs4_opendata_reclaim_to_nfs4_state()Anna Schumaker1-8/+3
There is no need for a goto just to return an error code without any cleanup. Returning the error directly helps to clean up the code. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Remove nfs4_wait_for_completion_rpc_task()Anna Schumaker1-15/+7
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Clean up _nfs4_is_integrity_protected()Anna Schumaker1-6/+1
We can cut out the if statement and return the results of the comparison directly. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Fix inconsistent indentation in nfs4proc.cAnna Schumaker1-28/+28
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Make trace_nfs4_setup_sequence() available to NFS v4.0Anna Schumaker3-35/+35
This tracepoint displays information about the slot that was chosen for the RPC, in addition to session information. This could be useful information for debugging, and we can set the session id hash to 0 to indicate that there is no session. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Merge the remaining setup_sequence functionsAnna Schumaker1-67/+20
This creates a single place for all the work to happen, using the presence of a session to determine if extra values need to be set. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Check if the slot table is draining from nfs4_setup_sequence()Anna Schumaker1-14/+9
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Handle setup sequence task rescheduling in a single placeAnna Schumaker1-22/+15
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Lock the slot table from a single place during setup sequenceAnna Schumaker1-13/+12
Rather than implementing this twice for NFS v4.0 and v4.1 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Move slot-already-allocated check into nfs_setup_sequence()Anna Schumaker1-10/+9
This puts the check in a single place, rather than needing to implement it twice for v4.0 and v4.1. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Create a single nfs4_setup_sequence() functionAnna Schumaker1-47/+29
The inline ifdef lets us put everything in a single place, rather than having two (very similar) versions of this function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Use nfs4_setup_sequence() everywhereAnna Schumaker5-52/+33
This does the right thing depending on if we have a session, rather than needing to handle this manually in multiple places. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Change nfs4_setup_sequence() to take an nfs_client structureAnna Schumaker1-16/+15
I want to have all callers use this function, rather than calling the NFS v4.0 and v4.1 versions directly. This includes pNFS, which only has access to the nfs_client structure in some places. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Change nfs4_get_session() to take an nfs_client structureAnna Schumaker3-10/+9
pNFS only has access to the nfs_client structure, and not the nfs_server, so we need to make this change so the function can be used by pNFS as well. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: Move nfs4_get_session() into nfs4_session.hAnna Schumaker3-10/+6
This puts session related functions together in the same space. I only keep one version of this function, since this variable will always be NULL when using NFS v4.0. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30NFS: tidy up nfs_show_mountd_netidNeilBrown1-14/+7
This function is a bit clumsy, incorrectly producing ",mountproto=" if mountd_protocol is 0 and !showdefaults, and duplicating the code for reporting "auto". Tidy it up so that it only makes a single seq_printf() call, and more obviously does the right thing. Fixes: ee671b016fbf ("NFS: convert proto= option to use netids rather than a protoname") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30sunrpc & nfs: Add and use dprintk_cont macrosJoe Perches1-3/+3
Allow line continuations to work properly with KERN_CONT. Signed-off-by: Joe Perches <joe@perches.com> [Anna: Add fallback dprintk_cont() for when CONFIG_SUNRPC_DEBUG=n] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-28Merge tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-2/+5
Pull NFS client bugfixes from Trond Myklebust: "Stable patches: - NFSv4.1: Fix a deadlock in layoutget - NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors - NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode - Fix a memory leak when removing the SUNRPC module Bugfixes: - Fix a reference leak in _pnfs_return_layout" * tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: pNFS: Fix a reference leak in _pnfs_return_layout nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED" SUNRPC: cleanup ida information when removing sunrpc module NFSv4.0: always send mode in SETATTR after EXCLUSIVE4 nfs: Don't increment lock sequence ID after NFS4ERR_MOVED NFSv4.1: Fix a deadlock in layoutget
2017-01-27Merge tag 'xfs-for-linus-4.10-rc6-5' of ↵Linus Torvalds13-63/+220
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs uodates from Darrick Wong: "I have some more fixes this week: better input validation, corruption avoidance, build fixes, memory leak fixes, and a couple from Christoph to avoid an ENOSPC failure. Summary: - Fix race conditions in the CoW code - Fix some incorrect input validation checks - Avoid crashing fs by running out of space when freeing inodes - Fix toctou race wrt whether or not an inode has an attr - Fix build error on arm - Fix page refcount corruption when readahead fails - Don't corrupt userspace in the bmap ioctl" * tag 'xfs-for-linus-4.10-rc6-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: prevent quotacheck from overloading inode lru xfs: fix bmv_count confusion w/ shared extents xfs: clear _XBF_PAGES from buffers when readahead page xfs: extsize hints are not unlikely in xfs_bmap_btalloc xfs: remove racy hasattr check from attr ops xfs: use per-AG reservations for the finobt xfs: only update mount/resv fields on success in __xfs_ag_resv_init xfs: verify dirblocklog correctly xfs: fix COW writeback race
2017-01-27Merge branch 'for-linus-4.10' of ↵Linus Torvalds1-9/+17
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Some fixes that we've collected from the list. We still have one more pending to nail down a regression in lzo compression, but I wanted to get this batch out the door" * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations Btrfs: disable xattr operations on subvolume directories Btrfs: remove old tree_root case in btrfs_read_locked_inode() Btrfs: fix truncate down when no_holes feature is enabled Btrfs: Fix deadlock between direct IO and fast fsync btrfs: fix false enospc error when truncating heavily reflinked file
2017-01-27Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+3
Pull block fixes from Jens Axboe: "A set of fixes for this series. This contains: - Set of fixes for the nvme target code - A revert of patch from this merge window, causing a regression with WRITE_SAME on iSCSI targets at least. - A fix for a use-after-free in the new O_DIRECT bdev code. - Two fixes for the xen-blkfront driver" * 'for-linus' of git://git.kernel.dk/linux-block: Revert "sd: remove __data_len hack for WRITE SAME" nvme-fc: use blk_rq_nr_phys_segments nvmet-rdma: Fix missing dma sync to nvme data structures nvmet: Call fatal_error from keep-alive timout expiration nvmet: cancel fatal error and flush async work before free controller nvmet: delete controllers deletion upon subsystem release nvmet_fc: correct logic in disconnect queue LS handling block: fix use after free in __blkdev_direct_IO xen-blkfront: correct maximum segment accounting xen-blkfront: feature flags handling adjustments
2017-01-27xfs: prevent quotacheck from overloading inode lruBrian Foster1-1/+2
Quotacheck runs at mount time in situations where quota accounting must be recalculated. In doing so, it uses bulkstat to visit every inode in the filesystem. Historically, every inode processed during quotacheck was released and immediately tagged for reclaim because quotacheck runs before the superblock is marked active by the VFS. In other words, the final iput() lead to an immediate ->destroy_inode() call, which allowed the XFS background reclaim worker to start reclaiming inodes. Commit 17c12bcd3 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") marks the XFS superblock active sooner as part of the mount process to support caching inodes processed during log recovery. This occurs before quotacheck and thus means all inodes processed by quotacheck are inserted to the LRU on release. The s_umount lock is held until the mount has completed and thus prevents the shrinkers from operating on the sb. This means that quotacheck can excessively populate the inode LRU and lead to OOM conditions on systems without sufficient RAM. Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on inodes processed by quotacheck. This causes ->drop_inode() to return 1 and in turn causes iput_final() to evict the inode. This preserves the original quotacheck behavior and prevents it from overloading the LRU and running out of memory. CC: stable@vger.kernel.org # v4.9 Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-27Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operationsOmar Sandoval1-2/+0
Subvolume directory inodes can't have ACLs. Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-01-27Btrfs: disable xattr operations on subvolume directoriesOmar Sandoval1-0/+1
When you snapshot a subvolume containing a subvolume, you get a placeholder directory where the subvolume would be. These directory inodes have ->i_ops set to btrfs_dir_ro_inode_operations. Previously, these i_ops didn't include the xattr operation callbacks. The conversion to xattr_handlers missed this case, leading to bogus attempts to set xattrs on these inodes. This manifested itself as failures when running delayed inodes. To fix this, clear IOP_XATTR in ->i_opflags on these inodes. Fixes: 6c6ef9f26e59 ("xattr: Stop calling {get,set,remove}xattr inode operations") Cc: Andreas Gruenbacher <agruenba@redhat.com> Reported-by: Chris Murphy <lists@colorremedies.com> Tested-by: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-01-27Btrfs: remove old tree_root case in btrfs_read_locked_inode()Omar Sandoval1-4/+1
As Jeff explained in c2951f32d36c ("btrfs: remove old tree_root dirent processing in btrfs_real_readdir()"), supporting this old format is no longer necessary since the Btrfs magic number has been updated since we changed to the current format. There are other places where we still handle this old format, but since this is part of a fix that is going to stable, I'm only removing this one for now. Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-01-26pNFS: Fix a reference leak in _pnfs_return_layoutTrond Myklebust1-1/+1
IF NFS_LAYOUT_RETURN_REQUESTED is not set, then we currently exit without freeing the list of invalidated layout segments, leading to a reference leak. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: 24408f5282 ("pNFS: Fix bugs in _pnfs_return_layout") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-26nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"Chuck Lever1-0/+1
Lock sequence IDs are bumped in decode_lock by calling nfs_increment_seqid(). nfs_increment_sequid() does not use the seqid_mutating_err() function fixed in commit 059aa7348241 ("Don't increment lock sequence ID after NFS4ERR_MOVED"). Fixes: 059aa7348241 ("Don't increment lock sequence ID after ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Xuan Qi <xuan.qi@oracle.com> Cc: stable@vger.kernel.org # v3.7+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-26xfs: fix bmv_count confusion w/ shared extentsDarrick J. Wong1-10/+18
In a bmapx call, bmv_count is the total size of the array, including the zeroth element that userspace uses to supply the search key. The output array starts at offset 1 so that we can set up the user for the next invocation. Since we now can split an extent into multiple bmap records due to shared/unshared status, we have to be careful that we don't overflow the output array. In the original patch f86f403794b ("xfs: teach get_bmapx about shared extents and the CoW fork") I used cur_ext (the output index) to check for overflows, albeit with an off-by-one error. Since nexleft no longer describes the number of unfilled slots in the output, we can rip all that out and use cur_ext for the overflow check directly. Failure to do this causes heap corruption in bmapx callers such as xfs_io and xfs_scrub. xfs/328 can reproduce this problem. Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-26xfs: clear _XBF_PAGES from buffers when readahead pageDarrick J. Wong1-0/+1
If we try to allocate memory pages to back an xfs_buf that we're trying to read, it's possible that we'll be so short on memory that the page allocation fails. For a blocking read we'll just wait, but for readahead we simply dump all the pages we've collected so far. Unfortunately, after dumping the pages we neglect to clear the _XBF_PAGES state, which means that the subsequent call to xfs_buf_free thinks that b_pages still points to pages we own. It then double-frees the b_pages pages. This results in screaming about negative page refcounts from the memory manager, which xfs oughtn't be triggering. To reproduce this case, mount a filesystem where the size of the inodes far outweighs the availalble memory (a ~500M inode filesystem on a VM with 300MB memory did the trick here) and run bulkstat in parallel with other memory eating processes to put a huge load on the system. The "check summary" phase of xfs_scrub also works for this purpose. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2017-01-25xfs: extsize hints are not unlikely in xfs_bmap_btallocChristoph Hellwig1-2/+2
With COW files they are the hotpath, just like for files with the extent size hint attribute. We really shouldn't micro-manage anything but failure cases with unlikely. Additionally Arnd Bergmann recently reported that one of these two unlikely annotations causes link failures together with an upcoming kernel instrumentation patch, so let's get rid of it ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25xfs: remove racy hasattr check from attr opsBrian Foster1-6/+0
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize away a lock cycle in cases where the fork does not exist or is otherwise empty. This check is not safe, however, because an attribute fork short form to extent format conversion includes a transient state that causes the xfs_inode_hasattr() check to fail. Specifically, xfs_attr_shortform_to_leaf() creates an empty extent format attribute fork and then adds the existing shortform attributes to it. This means that lookup of an existing xattr can spuriously return -ENOATTR when racing against a setxattr that causes the associated format conversion. This was originally reproduced by an untar on a particularly configured glusterfs volume, but can also be reproduced on demand with properly crafted xattr requests. The format conversion occurs under the exclusive ilock. xfs_attr_get() and xfs_attr_remove() already have the proper locking and checks further down in the functions to handle this situation correctly. Drop the unlocked checks to avoid the spurious failure and rely on the existing logic. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25xfs: use per-AG reservations for the finobtChristoph Hellwig5-20/+144
Currently we try to rely on the global reserved block pool for block allocations for the free inode btree, but I have customer reports (fairly complex workload, need to find an easier reproducer) where that is not enough as the AG where we free an inode that requires a new finobt block is entirely full. This causes us to cancel a dirty transaction and thus a file system shutdown. I think the right way to guard against this is to treat the finot the same way as the refcount btree and have a per-AG reservations for the possible worst case size of it, and the patch below implements that. Note that this could increase mount times with large finobt trees. In an ideal world we would have added a field for the number of finobt fields to the AGI, similar to what we did for the refcount blocks. We should do add it next time we rev the AGI or AGF format by adding new fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25xfs: only update mount/resv fields on success in __xfs_ag_resv_initChristoph Hellwig1-9/+14
Try to reserve the blocks first and only then update the fields in or hanging off the mount structure. This way we can call __xfs_ag_resv_init again after a previous failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25romfs: use different way to generate fsid for BLOCK or MTDColy Li1-1/+22
Commit 8a59f5d25265 ("fs/romfs: return f_fsid for statfs(2)") generates a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev will triger an oops. Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y, both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined. Therefore when calling huge_encode_dev() to generate a 64bit id, I use the follow order to choose parameter, - CONFIG_ROMFS_ON_BLOCK defined use sb->s_bdev->bd_dev - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined use sb->s_dev when, - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined leave id as 0 When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index, otherwise sb->s_dev is 0. This is a try-best effort to generate a uniq file system ID, if all the above conditions are not meet, f_fsid of this romfs instance will be 0. Generally only one romfs can be built on single MTD block device, this method is enough to identify multiple romfs instances in a computer. Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Nong Li <nongli1031@gmail.com> Tested-by: Nong Li <nongli1031@gmail.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25proc: add a schedule point in proc_pid_readdir()Eric Dumazet1-0/+2
We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [akpm@linux-foundation.org: coding style fix] Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25userfaultfd: fix SIGBUS resulting from false rwsem wakeupsAndrea Arcangeli1-2/+35
With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25dax: fix build warnings with FS_DAX and !FS_IOMAPRoss Zwisler4-4/+1
As reported by Arnd: https://lkml.org/lkml/2017/1/10/756 Compiling with the following configuration: # CONFIG_EXT2_FS is not set # CONFIG_EXT4_FS is not set # CONFIG_XFS_FS is not set # CONFIG_FS_IOMAP depends on the above filesystems, as is not set CONFIG_FS_DAX=y generates build warnings about unused functions in fs/dax.c: fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function] static int dax_insert_mapping(struct address_space *mapping, ^~~~~~~~~~~~~~~~~~ fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function] static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, ^~~~~~~~~~~~~ fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function] static int dax_load_hole(struct address_space *mapping, void **entry, ^~~~~~~~~~~~~ fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function] static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, ^~~~~~~~~~~~~~~~~~ Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. This allows us to remove conditional selections of FS_IOMAP when FS_DAX was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c. Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24xfs: verify dirblocklog correctlyDarrick J. Wong1-1/+1
sb_dirblklog is added to sb_blocklog to compute the directory block size in bytes. Therefore, we must compare the sum of both those values against XFS_MAX_BLOCKSIZE_LOG, not just dirblklog. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-01-24NFSv4.0: always send mode in SETATTR after EXCLUSIVE4Benjamin Coddington1-1/+2
Some nfsv4.0 servers may return a mode for the verifier following an open with EXCLUSIVE4 createmode, but this does not mean the client should skip setting the mode in the following SETATTR. It should only do that for EXCLUSIVE4_1 or UNGAURDED createmode. Fixes: 5334c5bdac92 ("NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>