summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2017-02-27f2fs: use MAX_FREE_NIDS for the free nids targetKinglong Mee1-5/+3
F2FS has define MAX_FREE_NIDS for maximum of cached free nids target. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: introduce free nid bitmapChao Yu3-10/+119
In scenario of intensively node allocation, free nids will be ran out soon, then it needs to stop to load free nids by traversing NAT blocks, in worse case, if NAT blocks does not be cached in memory, it generates IOs which slows down our foreground operations. In order to speed up node allocation, in this patch we introduce a new free_nid_bitmap array, so there is an bitmap table for each NAT block, Once the NAT block is loaded, related bitmap cache will be switched on, and bitmap will be set during traversing nat entries in NAT block, later we can query and update nid usage status in memory completely. With such implementation, I expect performance of node allocation can be improved in the long-term after filesystem image is mounted. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: new helper cur_cp_crc() getting crc in f2fs_checkpointKinglong Mee4-19/+15
There are four places that getting the crc value in f2fs_checkpoint, just add a new helper cur_cp_crc for them. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: update the comment of default nr_pages to skippingKinglong Mee1-2/+2
Fixes: 2c237ebaa4 ("f2fs: avoid writing node/metapages during writes") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: drop the duplicate pval in f2fs_getxattrKinglong Mee1-3/+0
Fixes: ba38c27eb9 ("f2fs: enhance lookup xattr") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: Don't update the xattr data that same as the existKinglong Mee1-4/+16
f2fs removes the old xattr data and appends the new data although the new data is same as the exist. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: kill __is_extent_sameChao Yu2-12/+3
Since commit ee6d182f2a19 ("f2fs: remove syncing inode page in all the cases") delayed inode element updating from inode cache to node page cache, so once largest cached extent is updated, we can make inode dirty immediately instead of checking and updating it in the end of extent cache update. The above commit didn't clean up unneeded codes in extent_cache.c, let's finish the job in this patch. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: avoid bggc->fggc when enough free segments are avaliable after cpHou Pengyang1-8/+9
We use has_not_enough_free_secs to check if there are enough free segments, (free_sections(sbi) + freed) <= (node_secs + 2 * dent_secs + imeta_secs + reserved_sections(sbi) + needed); Under scenario with large number of dirty nodes, these nodes would be flushed during cp, as a result, right side of the inequality would be decreased, while left side stays unchanged if these nodes are flushed in SSR way, which means there are enough free segments after this cp. For this case, we just do a bggc instead of fggc. Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: select target segment with closer temperature in SSR modeChao Yu1-6/+17
In SSR mode, we can allocate target segment which has different temperature type from the type of current block, in order to avoid mixing coldest and hottest data/node as much as possible, change SSR allocation policy to select closer temperature for current block prior. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: show simple call stack in fault injection messageChao Yu8-13/+32
Previously kernel message can show that in which function we do the injection, but unfortunately, most of the caller are the same, for tracking more information of injection path, it needs to show upper caller's name. This patch supports that ability. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: no need lock_op in f2fs_write_inline_dataYunlei He1-2/+5
Similar as f2fs_write_inode, f2fs_write_inline_data just mark inode page dirty, so it's no need to write inline data under read lock of cp_rwsem. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: add bitmaps for empty or full NAT blocksJaegeuk Kim5-20/+230
This patches adds bitmaps to represent empty or full NAT blocks containing free nid entries. If we can find valid crc|cp_ver in the last block of checkpoint pack, we'll use these bitmaps when building free nids. In order to avoid checkpointing burden, up-to-date bitmaps will be flushed only during umount time. So, normally we can get this gain, but when power-cut happens, we rely on fsck.f2fs which recovers this bitmap again. After this patch, we build free nids from nid #0 at mount time to make more full NAT blocks, but in runtime, we check empty NAT blocks to load free nids without loading any NAT pages from disk. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: replace rw semaphore extent_tree_lock with mutex lockYunlei He2-12/+12
This patch replace rw semaphore extent_tree_lock with mutex lock for no read cases with this lock. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: avoid m_flags overlay when allocating more data blocksKinglong Mee1-1/+1
When more than one data blocks are allocated, the F2FS_MAP_UNWRITTEN/MAPPED flags will be overlapped by F2FS_MAP_NEW at the later times. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: remove unsafe bitmap checkingHou Pengyang1-6/+0
proc A: proc B: - writeback_sb_inodes - __writeback_single_inode - do_writepages - f2fs_write_node_pages - f2fs_balance_fs_bg - write_checkpoint - build_free_nids - flush_nat_entries - __build_free_nids - __flush_nat_entry_set - ra_meta_pages - get_next_nat_page - current_nat_addr - set_to_next_nat [do nat_bitmap checking] - f2fs_change_bit For proc A, nat_bitmap and nat_bitmap_mir would be compared without lock_op and nm_i->nat_tree_lock, while proc B is changing nat_bitmap/nat_bitmap_ver in cp. So it is normal for nat_bitmap/nat_bitmap diffrence under such scenario. This patch fix this by removing the monitoring point. [Fix: 599a09b f2fs: check in-memory nat version bitmap] Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: init local extent_info to avoid stale stack info in tpHou Pengyang2-5/+5
To avoid such stale(fops, blk, len) info in f2fs_lookup_extent_tree_end tp dio-23095 [005] ...1 17878.856859: f2fs_lookup_extent_tree_end: dev = (259,30), ino = 856, pgofs = 0, ext_info(fofs: 3441207040, blk: 4294967232, len: 3481143808) Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: remove unnecessary condition check for write_checkpoint in f2fs_gcYunlong Song1-9/+3
Since has_not_enough_free_secs(sbi, 0, 0) must be true if has_not_enough_ free_secs(sbi, sec_freed, 0) is true, write_checkpoint is sure to execute in both conditions. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: check discard alignment only for SEQWRITE zonesJaegeuk Kim1-12/+12
For converntional zones, we don't need to align discard commands to exact zone size. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: wait for discard completion after submissionJaegeuk Kim1-3/+13
We don't need to wait for each discard commands when unmounting the image. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: much larger batched trim_fs jobJaegeuk Kim1-1/+1
We have a kernel thread to issue discard commands, so we can increase the number of batched discard sections. By default, now it becomes 4GB range. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27f2fs: avoid very large discard commandJaegeuk Kim2-2/+4
This patch adds MAX_DISCARD_BLOCKS() to avoid issuing too much large single discard command. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-26Merge tag 'for-linus-4.11-ofs2' of ↵Linus Torvalds9-31/+49
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Orangefs: cleanups, a protocol fix and an added configuration button. Cleanups: - silence harmless integer overflow warning (from dan.carpenter@oracle.com) - Dan Carpenter influenced debugfs cleanups. - remove orangefs_backing_dev_info (from jack@suse.cz) Protocol fix: - fix buffer size mis-match between kernel space and user space New configuration button: - support readahead_readcnt parameter" * tag 'for-linus-4.11-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: fix buffer size mis-match between kernel space and user space. orangefs: Dan Carpenter influenced cleanups... orangefs: Remove orangefs_backing_dev_info orangefs: Support readahead_readcnt parameter. orangefs: silence harmless integer overflow warning
2017-02-26Merge branch 'for-linus-4.11' of ↵Linus Torvalds41-1242/+1211
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has a series of fixes and cleanups that Dave Sterba has been collecting. There is a pretty big variety here, cleaning up internal APIs and fixing corner cases" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (124 commits) Btrfs: use the correct type when creating cow dio extent Btrfs: fix deadlock between dedup on same file and starting writeback btrfs: use btrfs_debug instead of pr_debug in transaction abort btrfs: btrfs_truncate_free_space_cache always allocates path btrfs: free-space-cache, clean up unnecessary root arguments btrfs: convert btrfs_inc_block_group_ro to accept fs_info btrfs: flush_space always takes fs_info->fs_root btrfs: pass fs_info to (more) routines that are only called with extent_root btrfs: qgroup: Move half of the qgroup accounting time out of commit trans btrfs: remove unused parameter from adjust_slots_upwards btrfs: remove unused parameters from __btrfs_write_out_cache btrfs: remove unused parameter from cleanup_write_cache_enospc btrfs: remove unused parameter from __add_inode_ref btrfs: remove unused parameter from clone_copy_inline_extent btrfs: remove unused parameters from btrfs_cmp_data btrfs: remove unused parameter from __add_inline_refs btrfs: remove unused parameters from scrub_setup_wr_ctx btrfs: remove unused parameter from create_snapshot btrfs: remove unused parameter from init_first_rw_device btrfs: remove unused parameter from __btrfs_alloc_chunk ...
2017-02-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds31-260/+327
Merge more updates from Andrew Morton: - almost all of the rest of MM - misc bits - KASAN updates - procfs - lib/ updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (124 commits) checkpatch: remove false unbalanced braces warning checkpatch: notice unbalanced else braces in a patch checkpatch: add another old address for the FSF checkpatch: update $logFunctions checkpatch: warn on logging continuations checkpatch: warn on embedded function names lib/lz4: remove back-compat wrappers fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version crypto: change LZ4 modules to work with new LZ4 module version lib/decompress_unlz4: change module to work with new LZ4 module version lib: update LZ4 compressor module lib/test_sort.c: make it explicitly non-modular lib: add CONFIG_TEST_SORT to enable self-test of sort() rbtree: use designated initializers linux/kernel.h: fix DIV_ROUND_CLOSEST to support negative divisors lib/find_bit.c: micro-optimise find_next_*_bit lib: add module support to atomic64 tests lib: add module support to glob tests lib: add module support to crc32 tests kernel/ksysfs.c: add __ro_after_init to bin_attribute structure ...
2017-02-25Merge tag 'v4.10' of ↵Mike Marshall538-21240/+16359
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into for-next Linux 4.10
2017-02-25fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 versionSven Schmidt2-15/+19
Update fs/pstore and fs/squashfs to use the updated functions from the new LZ4 module. Link: http://lkml.kernel.org/r/1486321748-19085-5-git-send-email-4sschmid@informatik.uni-hamburg.de Signed-off-by: Sven Schmidt <4sschmid@informatik.uni-hamburg.de> Cc: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Rui Salvaterra <rsalvaterra@gmail.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@davemloft.net> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25procfs: use an enum for possible hidepid valuesLafcadio Wluiki3-6/+7
Previously, the hidepid parameter was checked by comparing literal integers 0, 1, 2. Let's add a proper enum for this, to make the checking more expressive: 0 → HIDEPID_OFF 1 → HIDEPID_NO_ACCESS 2 → HIDEPID_INVISIBLE This changes the internal labelling only, the userspace-facing interface remains unmodified, and still works with literal integers 0, 1, 2. No functional changes. Link: http://lkml.kernel.org/r/1484572984-13388-2-git-send-email-djalal@gmail.com Signed-off-by: Lafcadio Wluiki <wluikil@gmail.com> Signed-off-by: Djalal Harouni <tixxdz@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25proc: less code duplication in /proc/*/cmdlineAlexey Dobriyan1-88/+56
After staring at this code for a while I've figured using small 2-entry array describing ARGV and ENVP is the way to address code duplication critique. Link: http://lkml.kernel.org/r/20170105185724.GA12027@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25proc: use rb_entry()Geliang Tang1-5/+6
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Link: http://lkml.kernel.org/r/4fd1f82818665705ce75c5156a060ae7caa8e0a9.1482160150.git.geliangtang@gmail.com Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Juergen Gross <jgross@suse.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25userfaultfd_copy: return -ENOSPC in case mm has goneMike Rapoport1-0/+2
In the non-cooperative userfaultfd case, the process exit may race with outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC instead of -EINVAL when mm is already gone will allow uffd monitor to distinguish this case from other error conditions. Link: http://lkml.kernel.org/r/1485542673-24387-6-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25userfaultfd: non-cooperative: add event for exit() notificationMike Rapoport1-0/+28
Allow userfaultfd monitor track termination of the processes that have memory backed by the uffd. [rppt@linux.vnet.ibm.com: add comment] Link: http://lkml.kernel.org/r/20170202135448.GB19804@rapoport-lnxLink: http://lkml.kernel.org/r/1485542673-24387-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25userfaultfd: non-cooperative: add event for memory unmapsMike Rapoport3-3/+68
When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [arnd@arndb.de: fix nommu build] Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de [mhocko@suse.com: fix nommu build] Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25mm: replace FAULT_FLAG_SIZE with parameter to huge_faultDave Jiang4-12/+20
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has been somewhat painful with getting the flags set and removed at the correct locations. More than one kernel oops was introduced due to difficulties of getting the placement correctly. Remove the flag values and introduce an input parameter to huge_fault that indicates the size of the page entry. This makes the code easier to trace and should avoid the issues we see with the fault flags where removal of the flag was necessary in the fallback paths. Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25mm,fs,dax: change ->pmd_fault to ->huge_faultDave Jiang5-42/+40
Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang22-97/+89
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVEMike Rapoport1-7/+7
Patch series "userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request". These patches add notification of madvise(MADV_REMOVE) event to non-cooperative userfaultfd monitor. The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with relevant functions and structures. Using _REMOVE instead of _MADVDONTNEED describes the event semantics more clearly and I hope it's not too late for such change in the ABI. This patch (of 3): The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about removal of certain range from address space tracked by userfaultfd. Hence, UFFD_EVENT_REMOVE seems to better reflect the operation semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to 'remove' and the madvise_userfault_dontneed callback is renamed to userfaultfd_remove. Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+10
Pull block updates and fixes from Jens Axboe: - NVMe updates and fixes that missed the first pull request. This includes bug fixes, and support for autonomous power management. - Fix from Christoph for missing clear of the request payload, causing a problem with (at least) the storvsc driver. - Further fixes for the queue/bdi life time issues from Jan. - The Kconfig mq scheduler update from me. - Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this merge window. - Three fixes for nbd from Josef. - Bug fix from Omar, fixing a bug in sas transport code that oopses when bsg ioctls were used. From Omar. - Improvements to the queue restart and tag wait from from Omar. - Set of fixes for the sed/opal code from Scott. - Three trivial patches to cciss from Tobin * 'for-linus' of git://git.kernel.dk/linux-block: (41 commits) dm-rq: don't dereference request payload after ending request blk-mq-sched: separate mark hctx and queue restart operations blk-mq: use sbq wait queues instead of restart for driver tags block/sed-opal: Propagate original error message to userland. nvme/pci: re-check security protocol support after reset block/sed-opal: Introduce free_opal_dev to free the structure and clean up state nvme: detect NVMe controller in recent MacBooks nvme-rdma: add support for host_traddr nvmet-rdma: Fix error handling nvmet-rdma: use nvme cm status helper nvme-rdma: move nvme cm status helper to .h file nvme-fc: don't bother to validate ioccsz and iorcsz nvme/pci: No special case for queue busy on IO nvme/core: Fix race kicking freed request_queue nvme/pci: Disable on removal when disconnected nvme: Enable autonomous power state transitions nvme: Add a quirk mechanism that uses identify_ctrl nvme: make nvmf_register_transport require a create_ctrl callback nvme: Use CNS as 8-bit field and avoid endianness conversion nvme: add semicolon in nvme_command setting ...
2017-02-25nfs/nfsd/sunrpc: enforce transport requirements for NFSv4Jeff Layton2-6/+9
NFSv4 requires a transport "that is specified to avoid network congestion" (RFC 7530, section 3.1, paragraph 2). In practical terms, that means that you should not run NFSv4 over UDP. The server has never enforced that requirement, however. This patchset fixes this by adding a new flag to the svc_version that states that it has these transport requirements. With that, we can check that the transport has XPT_CONG_CTRL set before processing an RPC. If it doesn't we reject it with RPC_PROG_MISMATCH. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24sunrpc: turn bitfield flags in svc_version into boolsJeff Layton4-5/+3
It's just simpler to read this way, IMO. Also, no need to explicitly set vs_hidden to false in the nfsacl ones. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24nfsd: remove superfluous KERN_INFORasmus Villemoes1-1/+1
dprintk already provides a KERN_* prefix; this KERN_INFO just shows up as some odd characters in the output. Simplify the message a bit while we're there. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24libceph, rbd, ceph: WRITE | ONDISK -> WRITEIlya Dryomov2-20/+9
CEPH_OSD_FLAG_ONDISK is set in account_request(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24ceph: remove special ack vs commit behaviorIlya Dryomov6-103/+3
- ask for a commit reply instead of an ack reply in __ceph_pool_perm_get() - don't ask for both ack and commit replies in ceph_sync_write() - since just only one reply is requested now, i_unsafe_writes list will always be empty -- kill ceph_sync_write_wait() and go back to a standard ->evict_inode() Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24f2fs: do SSR for node segments more aggresivelyJaegeuk Kim1-5/+10
This patch gives more SSR chances for node blocks. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24f2fs: find data segments across all the typesJaegeuk Kim1-4/+11
Previously, if type is CURSEG_HOT_DATA, we only check CURSEG_HOT_DATA only. This patch fixes to search all the different types for SSR. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24f2fs: do SSR in higher priorityJaegeuk Kim1-16/+1
Let's check SSR in prior to LFS allocation. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24f2fs: do SSR for data when there is enough free spaceYunlong Song1-1/+1
In allocate_segment_by_default(), need_SSR() already detected it's time to do SSR. So, let's try to find victims for data segments more aggressively in time. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24f2fs: node segment is prior to data segment selected victimHou Pengyang1-1/+11
As data segment gc may lead dnode dirty, so the greedy cost for data segment should be valid blocks * 2, that is data segment is prior to node segment. Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24f2fs: put allocate_segment after refresh_sit_entryYunlong Song1-2/+3
SIT information should be updated before segment allocation, since SSR needs latest valid block information. Current code does not update the old_blkaddr info in sit_entry, so adjust the allocate_segment to its proper location. Commit 5e443818fa0b2a2845561ee25bec181424fb2889 ("f2fs: handle dirty segments inside refresh_sit_entry") puts it into wrong location. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-24Merge branch 'for-linus' of ↵Linus Torvalds22-199/+325
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
2017-02-24NFSv4: fix getacl ERANGE for some ACL buffer sizesWeston Andros Adamson1-6/+2
We're not taking into account that the space needed for the (variable length) attr bitmap, with the result that we'd sometimes get a spurious ERANGE when the ACL data got close to the end of a page. Just add in an extra page to make sure. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>