From eca56ff906bdd0239485e8b47154a6e73dd9a2f3 Mon Sep 17 00:00:00 2001 From: Jerome Marchand Date: Thu, 14 Jan 2016 15:19:26 -0800 Subject: mm, shmem: add internal shmem resident memory accounting Currently looking at /proc//status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand Signed-off-by: Vlastimil Babka Acked-by: Konstantin Khlebnikov Acked-by: Michal Hocko Acked-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 00bad7793788..a8ab1fc0e9bc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1361,10 +1361,26 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member) atomic_long_dec(&mm->rss_stat.count[member]); } +/* Optimized variant when page is already known not to be PageAnon */ +static inline int mm_counter_file(struct page *page) +{ + if (PageSwapBacked(page)) + return MM_SHMEMPAGES; + return MM_FILEPAGES; +} + +static inline int mm_counter(struct page *page) +{ + if (PageAnon(page)) + return MM_ANONPAGES; + return mm_counter_file(page); +} + static inline unsigned long get_mm_rss(struct mm_struct *mm) { return get_mm_counter(mm, MM_FILEPAGES) + - get_mm_counter(mm, MM_ANONPAGES); + get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); } static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm) -- cgit v1.2.3 From d07e22597d1d355829b7b18ac19afa912cf758d1 Mon Sep 17 00:00:00 2001 From: Daniel Cashman Date: Thu, 14 Jan 2016 15:19:53 -0800 Subject: mm: mmap: add new /proc tunable for mmap_base ASLR Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: Daniel Cashman Cc: Russell King Acked-by: Kees Cook Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Don Zickus Cc: Eric W. Biederman Cc: Heinrich Schuchardt Cc: Josh Poimboeuf Cc: Kirill A. Shutemov Cc: Naoya Horiguchi Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Thomas Gleixner Cc: David Rientjes Cc: Mark Salyzyn Cc: Jeff Vander Stoep Cc: Nick Kralevich Cc: Catalin Marinas Cc: Will Deacon Cc: "H. Peter Anvin" Cc: Hector Marco-Gisbert Cc: Borislav Petkov Cc: Ralf Baechle Cc: Heiko Carstens Cc: Martin Schwidefsky Cc: Benjamin Herrenschmidt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 29 +++++++++++++++++++ arch/Kconfig | 68 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/mm.h | 11 ++++++++ kernel/sysctl.c | 22 +++++++++++++++ mm/mmap.c | 12 ++++++++ 5 files changed, 142 insertions(+) (limited to 'include/linux/mm.h') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index f72370b440b1..ee763f3d3b52 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -42,6 +42,8 @@ Currently, these files are in /proc/sys/vm: - min_slab_ratio - min_unmapped_ratio - mmap_min_addr +- mmap_rnd_bits +- mmap_rnd_compat_bits - nr_hugepages - nr_overcommit_hugepages - nr_trim_pages (only if CONFIG_MMU=n) @@ -485,6 +487,33 @@ against future potential kernel bugs. ============================================================== +mmap_rnd_bits: + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations on architectures which support +tuning address space randomization. This value will be bounded +by the architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_bits tunable + +============================================================== + +mmap_rnd_compat_bits: + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations for applications run in +compatibility mode on architectures which support tuning address +space randomization. This value will be bounded by the +architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_compat_bits tunable + +============================================================== + nr_hugepages Change the minimum size of the hugepage pool. diff --git a/arch/Kconfig b/arch/Kconfig index 4e949e58b192..ba1b626bca00 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -511,6 +511,74 @@ config ARCH_HAS_ELF_RANDOMIZE - arch_mmap_rnd() - arch_randomize_brk() +config HAVE_ARCH_MMAP_RND_BITS + bool + help + An arch should select this symbol if it supports setting a variable + number of bits for use in establishing the base address for mmap + allocations, has MMU enabled and provides values for both: + - ARCH_MMAP_RND_BITS_MIN + - ARCH_MMAP_RND_BITS_MAX + +config ARCH_MMAP_RND_BITS_MIN + int + +config ARCH_MMAP_RND_BITS_MAX + int + +config ARCH_MMAP_RND_BITS_DEFAULT + int + +config ARCH_MMAP_RND_BITS + int "Number of bits to use for ASLR of mmap base address" if EXPERT + range ARCH_MMAP_RND_BITS_MIN ARCH_MMAP_RND_BITS_MAX + default ARCH_MMAP_RND_BITS_DEFAULT if ARCH_MMAP_RND_BITS_DEFAULT + default ARCH_MMAP_RND_BITS_MIN + depends on HAVE_ARCH_MMAP_RND_BITS + help + This value can be used to select the number of bits to use to + determine the random offset to the base address of vma regions + resulting from mmap allocations. This value will be bounded + by the architecture's minimum and maximum supported values. + + This value can be changed after boot using the + /proc/sys/vm/mmap_rnd_bits tunable + +config HAVE_ARCH_MMAP_RND_COMPAT_BITS + bool + help + An arch should select this symbol if it supports running applications + in compatibility mode, supports setting a variable number of bits for + use in establishing the base address for mmap allocations, has MMU + enabled and provides values for both: + - ARCH_MMAP_RND_COMPAT_BITS_MIN + - ARCH_MMAP_RND_COMPAT_BITS_MAX + +config ARCH_MMAP_RND_COMPAT_BITS_MIN + int + +config ARCH_MMAP_RND_COMPAT_BITS_MAX + int + +config ARCH_MMAP_RND_COMPAT_BITS_DEFAULT + int + +config ARCH_MMAP_RND_COMPAT_BITS + int "Number of bits to use for ASLR of mmap base address for compatible applications" if EXPERT + range ARCH_MMAP_RND_COMPAT_BITS_MIN ARCH_MMAP_RND_COMPAT_BITS_MAX + default ARCH_MMAP_RND_COMPAT_BITS_DEFAULT if ARCH_MMAP_RND_COMPAT_BITS_DEFAULT + default ARCH_MMAP_RND_COMPAT_BITS_MIN + depends on HAVE_ARCH_MMAP_RND_COMPAT_BITS + help + This value can be used to select the number of bits to use to + determine the random offset to the base address of vma regions + resulting from mmap allocations for compatible applications This + value will be bounded by the architecture's minimum and maximum + supported values. + + This value can be changed after boot using the + /proc/sys/vm/mmap_rnd_compat_bits tunable + config HAVE_COPY_THREAD_TLS bool help diff --git a/include/linux/mm.h b/include/linux/mm.h index a8ab1fc0e9bc..d396753c0577 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -51,6 +51,17 @@ extern int sysctl_legacy_va_layout; #define sysctl_legacy_va_layout 0 #endif +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS +extern const int mmap_rnd_bits_min; +extern const int mmap_rnd_bits_max; +extern int mmap_rnd_bits __read_mostly; +#endif +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS +extern const int mmap_rnd_compat_bits_min; +extern const int mmap_rnd_compat_bits_max; +extern int mmap_rnd_compat_bits __read_mostly; +#endif + #include #include #include diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5faf89ac9ec0..c810f8afdb7f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1568,6 +1568,28 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS + { + .procname = "mmap_rnd_bits", + .data = &mmap_rnd_bits, + .maxlen = sizeof(mmap_rnd_bits), + .mode = 0600, + .proc_handler = proc_dointvec_minmax, + .extra1 = (void *)&mmap_rnd_bits_min, + .extra2 = (void *)&mmap_rnd_bits_max, + }, +#endif +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS + { + .procname = "mmap_rnd_compat_bits", + .data = &mmap_rnd_compat_bits, + .maxlen = sizeof(mmap_rnd_compat_bits), + .mode = 0600, + .proc_handler = proc_dointvec_minmax, + .extra1 = (void *)&mmap_rnd_compat_bits_min, + .extra2 = (void *)&mmap_rnd_compat_bits_max, + }, +#endif { } }; diff --git a/mm/mmap.c b/mm/mmap.c index c311bfd8005b..f32b84ad621a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -58,6 +58,18 @@ #define arch_rebalance_pgtables(addr, len) (addr) #endif +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS +const int mmap_rnd_bits_min = CONFIG_ARCH_MMAP_RND_BITS_MIN; +const int mmap_rnd_bits_max = CONFIG_ARCH_MMAP_RND_BITS_MAX; +int mmap_rnd_bits __read_mostly = CONFIG_ARCH_MMAP_RND_BITS; +#endif +#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS +const int mmap_rnd_compat_bits_min = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN; +const int mmap_rnd_compat_bits_max = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX; +int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS; +#endif + + static void unmap_region(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long start, unsigned long end); -- cgit v1.2.3 From c20cd45eb01748f0fba77a504f956b000df4ea73 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 14 Jan 2016 15:20:12 -0800 Subject: mm: allow GFP_{FS,IO} for page_cache_read page cache allocation page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is called from the vm_operations_struct::fault() context during the page fault. This context doesn't need the reclaim protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. There is simply no reason to deliberately use a weaker allocation context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Signed-off-by: Michal Hocko Reported-by: Tetsuo Handa Acked-by: Jan Kara Acked-by: Vlastimil Babka Cc: Tetsuo Handa Cc: Mel Gorman Cc: Dave Chinner Cc: Mark Fasheh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 4 ++++ mm/filemap.c | 9 ++++----- mm/memory.c | 17 +++++++++++++++++ 3 files changed, 25 insertions(+), 5 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index d396753c0577..ec9d4559514d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -236,10 +236,14 @@ extern pgprot_t protection_map[16]; * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * + * MM layer fills up gfp_mask for page allocations but fault handler might + * alter it if its implementation requires a different allocation context. + * * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ + gfp_t gfp_mask; /* gfp mask to be used for allocations */ pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ diff --git a/mm/filemap.c b/mm/filemap.c index 1bb007624b53..ff42d31c891a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1812,19 +1812,18 @@ EXPORT_SYMBOL(generic_file_read_iter); * This adds the requested page to the page cache if it isn't already there, * and schedules an I/O to read in its contents from disk. */ -static int page_cache_read(struct file *file, pgoff_t offset) +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask) { struct address_space *mapping = file->f_mapping; struct page *page; int ret; do { - page = page_cache_alloc_cold(mapping); + page = __page_cache_alloc(gfp_mask|__GFP_COLD); if (!page) return -ENOMEM; - ret = add_to_page_cache_lru(page, mapping, offset, - mapping_gfp_constraint(mapping, GFP_KERNEL)); + ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL); if (ret == 0) ret = mapping->a_ops->readpage(file, page); else if (ret == -EEXIST) @@ -2005,7 +2004,7 @@ no_cached_page: * We're only likely to ever get here if MADV_RANDOM is in * effect. */ - error = page_cache_read(file, offset); + error = page_cache_read(file, offset, vmf->gfp_mask); /* * The page we want has now been added to the page cache. diff --git a/mm/memory.c b/mm/memory.c index f7026c035940..d4e4d37c1989 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1938,6 +1938,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va, vma); } +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) +{ + struct file *vm_file = vma->vm_file; + + if (vm_file) + return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO; + + /* + * Special mappings (e.g. VDSO) do not have any file so fake + * a default GFP_KERNEL for them. + */ + return GFP_KERNEL; +} + /* * Notify the address space that the page is about to become writable so that * it can prohibit this or wait for the page to get into an appropriate state. @@ -1953,6 +1967,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, vmf.virtual_address = (void __user *)(address & PAGE_MASK); vmf.pgoff = page->index; vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; + vmf.gfp_mask = __get_fault_gfp_mask(vma); vmf.page = page; vmf.cow_page = NULL; @@ -2757,6 +2772,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; + vmf.gfp_mask = __get_fault_gfp_mask(vma); vmf.cow_page = cow_page; ret = vma->vm_ops->fault(vma, &vmf); @@ -2923,6 +2939,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.max_pgoff = max_pgoff; vmf.flags = flags; + vmf.gfp_mask = __get_fault_gfp_mask(vma); vma->vm_ops->map_pages(vma, &vmf); } -- cgit v1.2.3 From 84638335900f1995495838fe1bd4870c43ec1f67 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 14 Jan 2016 15:22:07 -0800 Subject: mm: rework virtual memory accounting When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: Konstantin Khlebnikov Signed-off-by: Cyrill Gorcunov Cc: Quentin Casasnovas Cc: Vegard Nossum Acked-by: Linus Torvalds Cc: Willy Tarreau Cc: Andy Lutomirski Cc: Kees Cook Cc: Vladimir Davydov Cc: Pavel Emelyanov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/ia64/kernel/perfmon.c | 3 +-- fs/proc/task_mmu.c | 7 +++--- include/linux/mm.h | 13 +++------- include/linux/mm_types.h | 2 +- kernel/fork.c | 5 ++-- mm/debug.c | 4 +-- mm/mmap.c | 63 +++++++++++++++++++++++----------------------- mm/mprotect.c | 8 ++++-- mm/mremap.c | 7 +++--- 9 files changed, 54 insertions(+), 58 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index 60e02f7747ff..9cd607b06964 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -2332,8 +2332,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t */ insert_vm_struct(mm, vma); - vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, - vma_pages(vma)); + vm_stat_account(vma->vm_mm, vma->vm_flags, vma_pages(vma)); up_write(&task->mm->mmap_sem); /* diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 46d9619d0aea..a353b4c6e86e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -23,7 +23,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) { - unsigned long data, text, lib, swap, ptes, pmds, anon, file, shmem; + unsigned long text, lib, swap, ptes, pmds, anon, file, shmem; unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; anon = get_mm_counter(mm, MM_ANONPAGES); @@ -44,7 +44,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) if (hiwater_rss < mm->hiwater_rss) hiwater_rss = mm->hiwater_rss; - data = mm->total_vm - mm->shared_vm - mm->stack_vm; text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; swap = get_mm_counter(mm, MM_SWAPENTS); @@ -76,7 +75,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) anon << (PAGE_SHIFT-10), file << (PAGE_SHIFT-10), shmem << (PAGE_SHIFT-10), - data << (PAGE_SHIFT-10), + mm->data_vm << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, ptes >> 10, pmds >> 10, @@ -97,7 +96,7 @@ unsigned long task_statm(struct mm_struct *mm, get_mm_counter(mm, MM_SHMEMPAGES); *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; - *data = mm->total_vm - mm->shared_vm; + *data = mm->data_vm + mm->stack_vm; *resident = *shared + get_mm_counter(mm, MM_ANONPAGES); return mm->total_vm; } diff --git a/include/linux/mm.h b/include/linux/mm.h index ec9d4559514d..839d9e9a1c38 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1929,7 +1929,9 @@ extern void mm_drop_all_locks(struct mm_struct *mm); extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); extern struct file *get_mm_exe_file(struct mm_struct *mm); -extern int may_expand_vm(struct mm_struct *mm, unsigned long npages); +extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages); +extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages); + extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm, unsigned long addr, unsigned long len, unsigned long flags, @@ -2147,15 +2149,6 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); -#ifdef CONFIG_PROC_FS -void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long); -#else -static inline void vm_stat_account(struct mm_struct *mm, - unsigned long flags, struct file *file, long pages) -{ - mm->total_vm += pages; -} -#endif /* CONFIG_PROC_FS */ #ifdef CONFIG_DEBUG_PAGEALLOC extern bool _debug_pagealloc_enabled; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 207890be93c8..6bc9a0ce2253 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -427,7 +427,7 @@ struct mm_struct { unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long pinned_vm; /* Refcount permanently increased */ - unsigned long shared_vm; /* Shared pages (files) */ + unsigned long data_vm; /* VM_WRITE & ~VM_SHARED/GROWSDOWN */ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */ unsigned long stack_vm; /* VM_GROWSUP/DOWN */ unsigned long def_flags; diff --git a/kernel/fork.c b/kernel/fork.c index 51915842f1c0..2e391c754ae7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -414,7 +414,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm)); mm->total_vm = oldmm->total_vm; - mm->shared_vm = oldmm->shared_vm; + mm->data_vm = oldmm->data_vm; mm->exec_vm = oldmm->exec_vm; mm->stack_vm = oldmm->stack_vm; @@ -433,8 +433,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) struct file *file; if (mpnt->vm_flags & VM_DONTCOPY) { - vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, - -vma_pages(mpnt)); + vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue; } charge = 0; diff --git a/mm/debug.c b/mm/debug.c index 668aa35191ca..5d2072ed8d5e 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -175,7 +175,7 @@ void dump_mm(const struct mm_struct *mm) "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" "pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" - "pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n" + "pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" "start_brk %lx brk %lx start_stack %lx\n" "arg_start %lx arg_end %lx env_start %lx env_end %lx\n" @@ -209,7 +209,7 @@ void dump_mm(const struct mm_struct *mm) mm_nr_pmds((struct mm_struct *)mm), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, - mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm, + mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm, mm->start_code, mm->end_code, mm->start_data, mm->end_data, mm->start_brk, mm->brk, mm->start_stack, mm->arg_start, mm->arg_end, mm->env_start, mm->env_end, diff --git a/mm/mmap.c b/mm/mmap.c index f32b84ad621a..b3f00b616b81 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1220,24 +1220,6 @@ none: return NULL; } -#ifdef CONFIG_PROC_FS -void vm_stat_account(struct mm_struct *mm, unsigned long flags, - struct file *file, long pages) -{ - const unsigned long stack_flags - = VM_STACK_FLAGS & (VM_GROWSUP|VM_GROWSDOWN); - - mm->total_vm += pages; - - if (file) { - mm->shared_vm += pages; - if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC) - mm->exec_vm += pages; - } else if (flags & stack_flags) - mm->stack_vm += pages; -} -#endif /* CONFIG_PROC_FS */ - /* * If a hint addr is less than mmap_min_addr change hint to be as * low as possible but still greater than mmap_min_addr @@ -1556,7 +1538,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long charged = 0; /* Check against address space limit. */ - if (!may_expand_vm(mm, len >> PAGE_SHIFT)) { + if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { unsigned long nr_pages; /* @@ -1565,7 +1547,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ nr_pages = count_vma_pages_range(mm, addr, addr + len); - if (!may_expand_vm(mm, (len >> PAGE_SHIFT) - nr_pages)) + if (!may_expand_vm(mm, vm_flags, + (len >> PAGE_SHIFT) - nr_pages)) return -ENOMEM; } @@ -1664,7 +1647,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, out: perf_event_mmap(vma); - vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT); + vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))) @@ -2111,7 +2094,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns unsigned long new_start, actual_size; /* address space limit tests */ - if (!may_expand_vm(mm, grow)) + if (!may_expand_vm(mm, vma->vm_flags, grow)) return -ENOMEM; /* Stack limit test */ @@ -2208,8 +2191,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) mm->locked_vm += grow; - vm_stat_account(mm, vma->vm_flags, - vma->vm_file, grow); + vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; anon_vma_interval_tree_post_update_vma(vma); @@ -2284,8 +2266,7 @@ int expand_downwards(struct vm_area_struct *vma, spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) mm->locked_vm += grow; - vm_stat_account(mm, vma->vm_flags, - vma->vm_file, grow); + vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; vma->vm_pgoff -= grow; @@ -2399,7 +2380,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma) if (vma->vm_flags & VM_ACCOUNT) nr_accounted += nrpages; - vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); + vm_stat_account(mm, vma->vm_flags, -nrpages); vma = remove_vma(vma); } while (vma); vm_unacct_memory(nr_accounted); @@ -2769,7 +2750,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) } /* Check against address space limits *after* clearing old maps... */ - if (!may_expand_vm(mm, len >> PAGE_SHIFT)) + if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) return -ENOMEM; if (mm->map_count > sysctl_max_map_count) @@ -2804,6 +2785,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) out: perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; + mm->data_vm += len >> PAGE_SHIFT; if (flags & VM_LOCKED) mm->locked_vm += (len >> PAGE_SHIFT); vma->vm_flags |= VM_SOFTDIRTY; @@ -2995,9 +2977,28 @@ out: * Return true if the calling process may expand its vm space by the passed * number of pages */ -int may_expand_vm(struct mm_struct *mm, unsigned long npages) +bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages) { - return mm->total_vm + npages <= rlimit(RLIMIT_AS) >> PAGE_SHIFT; + if (mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT) + return false; + + if ((flags & (VM_WRITE | VM_SHARED | (VM_STACK_FLAGS & + (VM_GROWSUP | VM_GROWSDOWN)))) == VM_WRITE) + return mm->data_vm + npages <= rlimit(RLIMIT_DATA); + + return true; +} + +void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) +{ + mm->total_vm += npages; + + if ((flags & (VM_EXEC | VM_WRITE)) == VM_EXEC) + mm->exec_vm += npages; + else if (flags & (VM_STACK_FLAGS & (VM_GROWSUP | VM_GROWSDOWN))) + mm->stack_vm += npages; + else if ((flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) + mm->data_vm += npages; } static int special_mapping_fault(struct vm_area_struct *vma, @@ -3079,7 +3080,7 @@ static struct vm_area_struct *__install_special_mapping( if (ret) goto out; - mm->total_vm += len >> PAGE_SHIFT; + vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT); perf_event_mmap(vma); diff --git a/mm/mprotect.c b/mm/mprotect.c index ef5be8eaab00..c764402c464f 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -278,6 +278,10 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, * even if read-only so there is no need to account for them here */ if (newflags & VM_WRITE) { + /* Check space limits when area turns into data. */ + if (!may_expand_vm(mm, newflags, nrpages) && + may_expand_vm(mm, oldflags, nrpages)) + return -ENOMEM; if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| VM_SHARED|VM_NORESERVE))) { charged = nrpages; @@ -334,8 +338,8 @@ success: populate_vma_page_range(vma, start, end, NULL); } - vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); - vm_stat_account(mm, newflags, vma->vm_file, nrpages); + vm_stat_account(mm, oldflags, -nrpages); + vm_stat_account(mm, newflags, nrpages); perf_event_mmap(vma); return 0; diff --git a/mm/mremap.c b/mm/mremap.c index de824e72c3e8..e55b157865d5 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -317,7 +317,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, * If this were a serious issue, we'd add a flag to do_munmap(). */ hiwater_vm = mm->hiwater_vm; - vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT); + vm_stat_account(mm, vma->vm_flags, new_len >> PAGE_SHIFT); /* Tell pfnmap has moved from this vma */ if (unlikely(vma->vm_flags & VM_PFNMAP)) @@ -383,7 +383,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EAGAIN); } - if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT)) + if (!may_expand_vm(mm, vma->vm_flags, + (new_len - old_len) >> PAGE_SHIFT)) return ERR_PTR(-ENOMEM); if (vma->vm_flags & VM_ACCOUNT) { @@ -545,7 +546,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; } - vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages); + vm_stat_account(mm, vma->vm_flags, pages); if (vma->vm_flags & VM_LOCKED) { mm->locked_vm += pages; locked = true; -- cgit v1.2.3 From ddc58f27f9eee9117219936f77e90ad5b2e00e96 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:52:56 -0800 Subject: mm: drop tail page refcounting Tail page refcounting is utterly complicated and painful to support. It uses ->_mapcount on tail pages to store how many times this page is pinned. get_page() bumps ->_mapcount on tail page in addition to ->_count on head. This information is required by split_huge_page() to be able to distribute pins from head of compound page to tails during the split. We will need ->_mapcount to account PTE mappings of subpages of the compound page. We eliminate need in current meaning of ->_mapcount in tail pages by forbidding split entirely if the page is pinned. The only user of tail page refcounting is THP which is marked BROKEN for now. Let's drop all this mess. It makes get_page() and put_page() much simpler. Signed-off-by: Kirill A. Shutemov Tested-by: Sasha Levin Tested-by: Aneesh Kumar K.V Acked-by: Vlastimil Babka Acked-by: Jerome Marchand Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Dave Hansen Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Steve Capper Cc: Johannes Weiner Cc: Michal Hocko Cc: Christoph Lameter Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mips/mm/gup.c | 4 - arch/powerpc/mm/hugetlbpage.c | 13 +- arch/s390/mm/gup.c | 13 +- arch/sparc/mm/gup.c | 14 +-- arch/x86/mm/gup.c | 4 - include/linux/mm.h | 47 ++------ include/linux/mm_types.h | 17 +-- mm/gup.c | 34 +----- mm/huge_memory.c | 41 +------ mm/hugetlb.c | 2 +- mm/internal.h | 44 ------- mm/swap.c | 273 +++--------------------------------------- 12 files changed, 40 insertions(+), 466 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c index 349995d19c7f..36a35115dc2e 100644 --- a/arch/mips/mm/gup.c +++ b/arch/mips/mm/gup.c @@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; - if (PageTail(page)) - get_huge_page_tail(page); (*nr)++; page++; refs++; @@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end, do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; - if (PageTail(page)) - get_huge_page_tail(page); (*nr)++; page++; refs++; diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 61b8b7ccea4f..6bd3afa775d3 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -999,7 +999,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, { unsigned long mask; unsigned long pte_end; - struct page *head, *page, *tail; + struct page *head, *page; pte_t pte; int refs; @@ -1022,7 +1022,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, head = pte_page(pte); page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; @@ -1044,15 +1043,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; } - /* - * Any tail page need their mapcount reference taken before we - * return. - */ - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c index 21c74a71e2ab..1a2cad52babf 100644 --- a/arch/s390/mm/gup.c +++ b/arch/s390/mm/gup.c @@ -55,7 +55,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { unsigned long mask, result; - struct page *head, *page, *tail; + struct page *head, *page; int refs; result = write ? 0 : _SEGMENT_ENTRY_PROTECT; @@ -67,7 +67,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; @@ -88,16 +87,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 0; } - /* - * Any tail page need their mapcount reference taken before we - * return. - */ - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 2e5c4fc2daa9..9091c5daa2e1 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, put_page(head); return 0; } - if (head != page) - get_huge_page_tail(page); pages[*nr] = page; (*nr)++; @@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { - struct page *head, *page, *tail; + struct page *head, *page; int refs; if (!(pmd_val(pmd) & _PAGE_VALID)) @@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; @@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 0; } - /* Any tail page need their mapcount reference taken before we - * return. - */ - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index ae9a37bf1371..65b0261d4bf3 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -136,8 +136,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, do { VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; - if (PageTail(page)) - get_huge_page_tail(page); (*nr)++; page++; refs++; @@ -212,8 +210,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr, do { VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; - if (PageTail(page)) - get_huge_page_tail(page); (*nr)++; page++; refs++; diff --git a/include/linux/mm.h b/include/linux/mm.h index 839d9e9a1c38..34387351930c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -466,44 +466,9 @@ static inline int page_count(struct page *page) return atomic_read(&compound_head(page)->_count); } -static inline bool __compound_tail_refcounted(struct page *page) -{ - return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page); -} - -/* - * This takes a head page as parameter and tells if the - * tail page reference counting can be skipped. - * - * For this to be safe, PageSlab and PageHeadHuge must remain true on - * any given page where they return true here, until all tail pins - * have been released. - */ -static inline bool compound_tail_refcounted(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - return __compound_tail_refcounted(page); -} - -static inline void get_huge_page_tail(struct page *page) -{ - /* - * __split_huge_page_refcount() cannot run from under us. - */ - VM_BUG_ON_PAGE(!PageTail(page), page); - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); - VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page); - if (compound_tail_refcounted(compound_head(page))) - atomic_inc(&page->_mapcount); -} - -extern bool __get_page_tail(struct page *page); - static inline void get_page(struct page *page) { - if (unlikely(PageTail(page))) - if (likely(__get_page_tail(page))) - return; + page = compound_head(page); /* * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. @@ -528,7 +493,15 @@ static inline void init_page_count(struct page *page) atomic_set(&page->_count, 1); } -void put_page(struct page *page); +void __put_page(struct page *page); + +static inline void put_page(struct page *page) +{ + page = compound_head(page); + if (put_page_testzero(page)) + __put_page(page); +} + void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6bc9a0ce2253..faf6fe88d6b3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -81,20 +81,9 @@ struct page { union { /* - * Count of ptes mapped in - * mms, to show when page is - * mapped & limit reverse map - * searches. - * - * Used also for tail pages - * refcounting instead of - * _count. Tail pages cannot - * be mapped and keeping the - * tail page _count zero at - * all times guarantees - * get_page_unless_zero() will - * never succeed on tail - * pages. + * Count of ptes mapped in mms, to show + * when page is mapped & limit reverse + * map searches. */ atomic_t _mapcount; diff --git a/mm/gup.c b/mm/gup.c index 1c50b506888d..7017abea9fd6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -130,7 +130,7 @@ retry: } if (flags & FOLL_GET) - get_page_foll(page); + get_page(page); if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1153,7 +1153,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { - struct page *head, *page, *tail; + struct page *head, *page; int refs; if (write && !pmd_write(orig)) @@ -1162,7 +1162,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, refs = 0; head = pmd_page(orig); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; @@ -1183,24 +1182,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; } - /* - * Any tail pages need their mapcount reference taken before we - * return. (This allows the THP code to bump their ref count when - * they are split into base pages). - */ - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { - struct page *head, *page, *tail; + struct page *head, *page; int refs; if (write && !pud_write(orig)) @@ -1209,7 +1197,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, refs = 0; head = pud_page(orig); page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; @@ -1230,12 +1217,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0; } - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } @@ -1244,7 +1225,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, struct page **pages, int *nr) { int refs; - struct page *head, *page, *tail; + struct page *head, *page; if (write && !pgd_write(orig)) return 0; @@ -1252,7 +1233,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, refs = 0; head = pgd_page(orig); page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); - tail = page; do { VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; @@ -1273,12 +1253,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0; } - while (refs--) { - if (PageTail(tail)) - get_huge_page_tail(tail); - tail++; - } - return 1; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 016b70ab5ed4..72fd53fe2b61 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1038,37 +1038,6 @@ unlock: spin_unlock(ptl); } -/* - * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages - * during copy_user_huge_page()'s copy_page_rep(): in the case when - * the source page gets split and a tail freed before copy completes. - * Called under pmd_lock of checked pmd, so safe from splitting itself. - */ -static void get_user_huge_page(struct page *page) -{ - if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) { - struct page *endpage = page + HPAGE_PMD_NR; - - atomic_add(HPAGE_PMD_NR, &page->_count); - while (++page < endpage) - get_huge_page_tail(page); - } else { - get_page(page); - } -} - -static void put_user_huge_page(struct page *page) -{ - if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) { - struct page *endpage = page + HPAGE_PMD_NR; - - while (page < endpage) - put_page(page++); - } else { - put_page(page); - } -} - static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, @@ -1221,7 +1190,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, ret |= VM_FAULT_WRITE; goto out_unlock; } - get_user_huge_page(page); + get_page(page); spin_unlock(ptl); alloc: if (transparent_hugepage_enabled(vma) && @@ -1242,7 +1211,7 @@ alloc: split_huge_pmd(vma, pmd, address); ret |= VM_FAULT_FALLBACK; } - put_user_huge_page(page); + put_page(page); } count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -1253,7 +1222,7 @@ alloc: put_page(new_page); if (page) { split_huge_pmd(vma, pmd, address); - put_user_huge_page(page); + put_page(page); } else split_huge_pmd(vma, pmd, address); ret |= VM_FAULT_FALLBACK; @@ -1275,7 +1244,7 @@ alloc: spin_lock(ptl); if (page) - put_user_huge_page(page); + put_page(page); if (unlikely(!pmd_same(*pmd, orig_pmd))) { spin_unlock(ptl); mem_cgroup_cancel_charge(new_page, memcg, true); @@ -1360,7 +1329,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page), page); if (flags & FOLL_GET) - get_page_foll(page); + get_page(page); out: return page; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e924529f7b38..84af842e828d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3865,7 +3865,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset); - get_page_foll(pages[i]); + get_page(pages[i]); } if (vmas) diff --git a/mm/internal.h b/mm/internal.h index 38e24b89e4c4..569facd1f6da 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -66,50 +66,6 @@ static inline void set_page_refcounted(struct page *page) set_page_count(page, 1); } -static inline void __get_page_tail_foll(struct page *page, - bool get_page_head) -{ - /* - * If we're getting a tail page, the elevated page->_count is - * required only in the head page and we will elevate the head - * page->_count and tail page->_mapcount. - * - * We elevate page_tail->_mapcount for tail pages to force - * page_tail->_count to be zero at all times to avoid getting - * false positives from get_page_unless_zero() with - * speculative page access (like in - * page_cache_get_speculative()) on tail pages. - */ - VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page); - if (get_page_head) - atomic_inc(&compound_head(page)->_count); - get_huge_page_tail(page); -} - -/* - * This is meant to be called as the FOLL_GET operation of - * follow_page() and it must be called while holding the proper PT - * lock while the pte (or pmd_trans_huge) is still mapping the page. - */ -static inline void get_page_foll(struct page *page) -{ - if (unlikely(PageTail(page))) - /* - * This is safe only because - * __split_huge_page_refcount() can't run under - * get_page_foll() because we hold the proper PT lock. - */ - __get_page_tail_foll(page, true); - else { - /* - * Getting a normal page or the head of a compound page - * requires to already have an elevated page->_count. - */ - VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); - atomic_inc(&page->_count); - } -} - extern unsigned long highest_memmap_pfn; /* diff --git a/mm/swap.c b/mm/swap.c index 39395fb549c0..3d65480422e8 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -89,260 +89,14 @@ static void __put_compound_page(struct page *page) (*dtor)(page); } -/** - * Two special cases here: we could avoid taking compound_lock_irqsave - * and could skip the tail refcounting(in _mapcount). - * - * 1. Hugetlbfs page: - * - * PageHeadHuge will remain true until the compound page - * is released and enters the buddy allocator, and it could - * not be split by __split_huge_page_refcount(). - * - * So if we see PageHeadHuge set, and we have the tail page pin, - * then we could safely put head page. - * - * 2. Slab THP page: - * - * PG_slab is cleared before the slab frees the head page, and - * tail pin cannot be the last reference left on the head page, - * because the slab code is free to reuse the compound page - * after a kfree/kmem_cache_free without having to check if - * there's any tail pin left. In turn all tail pinsmust be always - * released while the head is still pinned by the slab code - * and so we know PG_slab will be still set too. - * - * So if we see PageSlab set, and we have the tail page pin, - * then we could safely put head page. - */ -static __always_inline -void put_unrefcounted_compound_page(struct page *page_head, struct page *page) -{ - /* - * If @page is a THP tail, we must read the tail page - * flags after the head page flags. The - * __split_huge_page_refcount side enforces write memory barriers - * between clearing PageTail and before the head page - * can be freed and reallocated. - */ - smp_rmb(); - if (likely(PageTail(page))) { - /* - * __split_huge_page_refcount cannot race - * here, see the comment above this function. - */ - VM_BUG_ON_PAGE(!PageHead(page_head), page_head); - if (put_page_testzero(page_head)) { - /* - * If this is the tail of a slab THP page, - * the tail pin must not be the last reference - * held on the page, because the PG_slab cannot - * be cleared before all tail pins (which skips - * the _mapcount tail refcounting) have been - * released. - * - * If this is the tail of a hugetlbfs page, - * the tail pin may be the last reference on - * the page instead, because PageHeadHuge will - * not go away until the compound page enters - * the buddy allocator. - */ - VM_BUG_ON_PAGE(PageSlab(page_head), page_head); - __put_compound_page(page_head); - } - } else - /* - * __split_huge_page_refcount run before us, - * @page was a THP tail. The split @page_head - * has been freed and reallocated as slab or - * hugetlbfs page of smaller order (only - * possible if reallocated as slab on x86). - */ - if (put_page_testzero(page)) - __put_single_page(page); -} - -static __always_inline -void put_refcounted_compound_page(struct page *page_head, struct page *page) -{ - if (likely(page != page_head && get_page_unless_zero(page_head))) { - unsigned long flags; - - /* - * @page_head wasn't a dangling pointer but it may not - * be a head page anymore by the time we obtain the - * lock. That is ok as long as it can't be freed from - * under us. - */ - flags = compound_lock_irqsave(page_head); - if (unlikely(!PageTail(page))) { - /* __split_huge_page_refcount run before us */ - compound_unlock_irqrestore(page_head, flags); - if (put_page_testzero(page_head)) { - /* - * The @page_head may have been freed - * and reallocated as a compound page - * of smaller order and then freed - * again. All we know is that it - * cannot have become: a THP page, a - * compound page of higher order, a - * tail page. That is because we - * still hold the refcount of the - * split THP tail and page_head was - * the THP head before the split. - */ - if (PageHead(page_head)) - __put_compound_page(page_head); - else - __put_single_page(page_head); - } -out_put_single: - if (put_page_testzero(page)) - __put_single_page(page); - return; - } - VM_BUG_ON_PAGE(page_head != compound_head(page), page); - /* - * We can release the refcount taken by - * get_page_unless_zero() now that - * __split_huge_page_refcount() is blocked on the - * compound_lock. - */ - if (put_page_testzero(page_head)) - VM_BUG_ON_PAGE(1, page_head); - /* __split_huge_page_refcount will wait now */ - VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page); - atomic_dec(&page->_mapcount); - VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head); - VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page); - compound_unlock_irqrestore(page_head, flags); - - if (put_page_testzero(page_head)) { - if (PageHead(page_head)) - __put_compound_page(page_head); - else - __put_single_page(page_head); - } - } else { - /* @page_head is a dangling pointer */ - VM_BUG_ON_PAGE(PageTail(page), page); - goto out_put_single; - } -} - -static void put_compound_page(struct page *page) -{ - struct page *page_head; - - /* - * We see the PageCompound set and PageTail not set, so @page maybe: - * 1. hugetlbfs head page, or - * 2. THP head page. - */ - if (likely(!PageTail(page))) { - if (put_page_testzero(page)) { - /* - * By the time all refcounts have been released - * split_huge_page cannot run anymore from under us. - */ - if (PageHead(page)) - __put_compound_page(page); - else - __put_single_page(page); - } - return; - } - - /* - * We see the PageCompound set and PageTail set, so @page maybe: - * 1. a tail hugetlbfs page, or - * 2. a tail THP page, or - * 3. a split THP page. - * - * Case 3 is possible, as we may race with - * __split_huge_page_refcount tearing down a THP page. - */ - page_head = compound_head(page); - if (!__compound_tail_refcounted(page_head)) - put_unrefcounted_compound_page(page_head, page); - else - put_refcounted_compound_page(page_head, page); -} - -void put_page(struct page *page) +void __put_page(struct page *page) { if (unlikely(PageCompound(page))) - put_compound_page(page); - else if (put_page_testzero(page)) + __put_compound_page(page); + else __put_single_page(page); } -EXPORT_SYMBOL(put_page); - -/* - * This function is exported but must not be called by anything other - * than get_page(). It implements the slow path of get_page(). - */ -bool __get_page_tail(struct page *page) -{ - /* - * This takes care of get_page() if run on a tail page - * returned by one of the get_user_pages/follow_page variants. - * get_user_pages/follow_page itself doesn't need the compound - * lock because it runs __get_page_tail_foll() under the - * proper PT lock that already serializes against - * split_huge_page(). - */ - unsigned long flags; - bool got; - struct page *page_head = compound_head(page); - - /* Ref to put_compound_page() comment. */ - if (!__compound_tail_refcounted(page_head)) { - smp_rmb(); - if (likely(PageTail(page))) { - /* - * This is a hugetlbfs page or a slab - * page. __split_huge_page_refcount - * cannot race here. - */ - VM_BUG_ON_PAGE(!PageHead(page_head), page_head); - __get_page_tail_foll(page, true); - return true; - } else { - /* - * __split_huge_page_refcount run - * before us, "page" was a THP - * tail. The split page_head has been - * freed and reallocated as slab or - * hugetlbfs page of smaller order - * (only possible if reallocated as - * slab on x86). - */ - return false; - } - } - - got = false; - if (likely(page != page_head && get_page_unless_zero(page_head))) { - /* - * page_head wasn't a dangling pointer but it - * may not be a head page anymore by the time - * we obtain the lock. That is ok as long as it - * can't be freed from under us. - */ - flags = compound_lock_irqsave(page_head); - /* here __split_huge_page_refcount won't run anymore */ - if (likely(PageTail(page))) { - __get_page_tail_foll(page, false); - got = true; - } - compound_unlock_irqrestore(page_head, flags); - if (unlikely(!got)) - put_page(page_head); - } - return got; -} -EXPORT_SYMBOL(__get_page_tail); +EXPORT_SYMBOL(__put_page); /** * put_pages_list() - release a list of pages @@ -918,15 +672,6 @@ void release_pages(struct page **pages, int nr, bool cold) for (i = 0; i < nr; i++) { struct page *page = pages[i]; - if (unlikely(PageCompound(page))) { - if (zone) { - spin_unlock_irqrestore(&zone->lru_lock, flags); - zone = NULL; - } - put_compound_page(page); - continue; - } - /* * Make sure the IRQ-safe lock-holding time does not get * excessive with a continuous string of pages from the @@ -937,9 +682,19 @@ void release_pages(struct page **pages, int nr, bool cold) zone = NULL; } + page = compound_head(page); if (!put_page_testzero(page)) continue; + if (PageCompound(page)) { + if (zone) { + spin_unlock_irqrestore(&zone->lru_lock, flags); + zone = NULL; + } + __put_compound_page(page); + continue; + } + if (PageLRU(page)) { struct zone *pagezone = page_zone(page); -- cgit v1.2.3 From 3ac808fdd2b835547af81de75c813cf7ba2ab58f Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:53:07 -0800 Subject: mm, thp: remove compound_lock() We are going to use migration entries to stabilize page counts. It means we don't need compound_lock() for that. Signed-off-by: Kirill A. Shutemov Tested-by: Sasha Levin Tested-by: Aneesh Kumar K.V Acked-by: Vlastimil Babka Acked-by: Jerome Marchand Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Dave Hansen Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Steve Capper Cc: Johannes Weiner Cc: Michal Hocko Cc: Christoph Lameter Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 35 ----------------------------------- include/linux/page-flags.h | 12 +----------- mm/debug.c | 3 --- mm/memcontrol.c | 11 +++-------- 4 files changed, 4 insertions(+), 57 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 34387351930c..70f59de2e288 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -410,41 +410,6 @@ static inline int is_vmalloc_or_module_addr(const void *x) extern void kvfree(const void *addr); -static inline void compound_lock(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON_PAGE(PageSlab(page), page); - bit_spin_lock(PG_compound_lock, &page->flags); -#endif -} - -static inline void compound_unlock(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON_PAGE(PageSlab(page), page); - bit_spin_unlock(PG_compound_lock, &page->flags); -#endif -} - -static inline unsigned long compound_lock_irqsave(struct page *page) -{ - unsigned long uninitialized_var(flags); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - local_irq_save(flags); - compound_lock(page); -#endif - return flags; -} - -static inline void compound_unlock_irqrestore(struct page *page, - unsigned long flags) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - compound_unlock(page); - local_irq_restore(flags); -#endif -} - /* * The atomic page->_mapcount, starts from -1: so that transitions * both from it and to it can be tracked, using atomic_inc_and_test diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7bc7fd9c4c5c..0c42acca0338 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,9 +101,6 @@ enum pageflags { #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, /* hardware poisoned page. Don't touch */ #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - PG_compound_lock, -#endif #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) PG_young, PG_idle, @@ -613,12 +610,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define __PG_MLOCKED 0 #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define __PG_COMPOUND_LOCK (1 << PG_compound_lock) -#else -#define __PG_COMPOUND_LOCK 0 -#endif - /* * Flags checked when a page is freed. Pages being freed should not have * these flags set. It they are, there is a problem. @@ -628,8 +619,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED | \ - __PG_COMPOUND_LOCK) + 1 << PG_unevictable | __PG_MLOCKED) /* * Flags checked when a page is prepped for return by the page allocator. diff --git a/mm/debug.c b/mm/debug.c index 5d2072ed8d5e..03c9a13f9f6a 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -40,9 +40,6 @@ static const struct trace_print_flags pageflag_names[] = { #ifdef CONFIG_MEMORY_FAILURE {1UL << PG_hwpoison, "hwpoison" }, #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - {1UL << PG_compound_lock, "compound_lock" }, -#endif #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) {1UL << PG_young, "young" }, {1UL << PG_idle, "idle" }, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 311fd2b71bae..db2a5d9ad886 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2433,9 +2433,7 @@ void __memcg_kmem_uncharge(struct page *page, int order) /* * Because tail pages are not marked as "used", set it. We're under - * zone->lru_lock, 'splitting on pmd' and compound_lock. - * charge/uncharge will be never happen and move_account() is done under - * compound_lock(), so we don't have to take care of races. + * zone->lru_lock and migration entries setup in all page mappings. */ void mem_cgroup_split_huge_fixup(struct page *head) { @@ -4507,9 +4505,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * - * The caller must confirm following. - * - page is not on LRU (isolate_page() is useful.) - * - compound_lock is held when nr_pages > 1 + * The caller must make sure the page is not on LRU (isolate_page() is useful.) * * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" * from old cgroup. @@ -4868,8 +4864,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, struct page *page; /* - * We don't take compound_lock() here but no race with splitting thp - * happens because: + * No race with splitting thp happens because: * - if pmd_trans_huge_lock() returns 1, the relevant thp is not * under splitting, which means there's no concurrent thp split, * - if another thread runs into split_huge_page() just after we -- cgit v1.2.3 From 53f9263baba69fc1630e3c780c4d11b72643f962 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:53:42 -0800 Subject: mm: rework mapcount accounting to enable 4k mapping of THPs We're going to allow mapping of individual 4k pages of THP compound. It means we need to track mapcount on per small page basis. Straight-forward approach is to use ->_mapcount in all subpages to track how many time this subpage is mapped with PMDs or PTEs combined. But this is rather expensive: mapping or unmapping of a THP page with PMD would require HPAGE_PMD_NR atomic operations instead of single we have now. The idea is to store separately how many times the page was mapped as whole -- compound_mapcount. This frees up ->_mapcount in subpages to track PTE mapcount. We use the same approach as with compound page destructor and compound order to store compound_mapcount: use space in first tail page, ->mapping this time. Any time we map/unmap whole compound page (THP or hugetlb) -- we increment/decrement compound_mapcount. When we map part of compound page with PTE we operate on ->_mapcount of the subpage. page_mapcount() counts both: PTE and PMD mappings of the page. Basically, we have mapcount for a subpage spread over two counters. It makes tricky to detect when last mapcount for a page goes away. We introduced PageDoubleMap() for this. When we split THP PMD for the first time and there's other PMD mapping left we offset up ->_mapcount in all subpages by one and set PG_double_map on the compound page. These additional references go away with last compound_mapcount. This approach provides a way to detect when last mapcount goes away on per small page basis without introducing new overhead for most common cases. [akpm@linux-foundation.org: fix typo in comment] [mhocko@suse.com: ignore partial THP when moving task] Signed-off-by: Kirill A. Shutemov Tested-by: Aneesh Kumar K.V Acked-by: Jerome Marchand Cc: Sasha Levin Cc: Aneesh Kumar K.V Cc: Jerome Marchand Cc: Vlastimil Babka Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Dave Hansen Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Steve Capper Cc: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 26 +++++++++++- include/linux/mm_types.h | 1 + include/linux/page-flags.h | 36 +++++++++++++++++ include/linux/rmap.h | 4 +- mm/debug.c | 5 ++- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 +- mm/memcontrol.c | 8 ++++ mm/memory.c | 2 +- mm/migrate.c | 2 +- mm/page_alloc.c | 13 ++++-- mm/rmap.c | 98 +++++++++++++++++++++++++++++++++++----------- 12 files changed, 166 insertions(+), 35 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 70f59de2e288..67e0fab225e8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -410,6 +410,19 @@ static inline int is_vmalloc_or_module_addr(const void *x) extern void kvfree(const void *addr); +static inline atomic_t *compound_mapcount_ptr(struct page *page) +{ + return &page[1].compound_mapcount; +} + +static inline int compound_mapcount(struct page *page) +{ + if (!PageCompound(page)) + return 0; + page = compound_head(page); + return atomic_read(compound_mapcount_ptr(page)) + 1; +} + /* * The atomic page->_mapcount, starts from -1: so that transitions * both from it and to it can be tracked, using atomic_inc_and_test @@ -422,8 +435,17 @@ static inline void page_mapcount_reset(struct page *page) static inline int page_mapcount(struct page *page) { + int ret; VM_BUG_ON_PAGE(PageSlab(page), page); - return atomic_read(&page->_mapcount) + 1; + + ret = atomic_read(&page->_mapcount) + 1; + if (PageCompound(page)) { + page = compound_head(page); + ret += atomic_read(compound_mapcount_ptr(page)) + 1; + if (PageDoubleMap(page)) + ret--; + } + return ret; } static inline int page_count(struct page *page) @@ -934,7 +956,7 @@ static inline pgoff_t page_file_index(struct page *page) */ static inline int page_mapped(struct page *page) { - return atomic_read(&(page)->_mapcount) >= 0; + return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0; } /* diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index faf6fe88d6b3..809defe0597d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -54,6 +54,7 @@ struct page { * see PAGE_MAPPING_ANON below. */ void *s_mem; /* slab first object */ + atomic_t compound_mapcount; /* first tail page */ }; /* Second double word */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 0c42acca0338..19724e6ebd26 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -126,6 +126,9 @@ enum pageflags { /* SLOB */ PG_slob_free = PG_private, + + /* Compound pages. Stored in first tail page's flags */ + PG_double_map = PG_private_2, }; #ifndef __GENERATING_BOUNDS_H @@ -523,10 +526,43 @@ static inline int PageTransTail(struct page *page) return PageTail(page); } +/* + * PageDoubleMap indicates that the compound page is mapped with PTEs as well + * as PMDs. + * + * This is required for optimization of rmap operations for THP: we can postpone + * per small page mapcount accounting (and its overhead from atomic operations) + * until the first PMD split. + * + * For the page PageDoubleMap means ->_mapcount in all sub-pages is offset up + * by one. This reference will go away with last compound_mapcount. + * + * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap(). + */ +static inline int PageDoubleMap(struct page *page) +{ + return PageHead(page) && test_bit(PG_double_map, &page[1].flags); +} + +static inline int TestSetPageDoubleMap(struct page *page) +{ + VM_BUG_ON_PAGE(!PageHead(page), page); + return test_and_set_bit(PG_double_map, &page[1].flags); +} + +static inline int TestClearPageDoubleMap(struct page *page) +{ + VM_BUG_ON_PAGE(!PageHead(page), page); + return test_and_clear_bit(PG_double_map, &page[1].flags); +} + #else TESTPAGEFLAG_FALSE(TransHuge) TESTPAGEFLAG_FALSE(TransCompound) TESTPAGEFLAG_FALSE(TransTail) +TESTPAGEFLAG_FALSE(DoubleMap) + TESTSETFLAG_FALSE(DoubleMap) + TESTCLEARFLAG_FALSE(DoubleMap) #endif /* diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 038b6e704d9b..ebf3750e42b2 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -183,9 +183,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long); -static inline void page_dup_rmap(struct page *page) +static inline void page_dup_rmap(struct page *page, bool compound) { - atomic_inc(&page->_mapcount); + atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount); } /* diff --git a/mm/debug.c b/mm/debug.c index 03c9a13f9f6a..f05b2d5d6481 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -79,9 +79,12 @@ static void dump_flags(unsigned long flags, void dump_page_badflags(struct page *page, const char *reason, unsigned long badflags) { - pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n", + pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx", page, atomic_read(&page->_count), page_mapcount(page), page->mapping, page->index); + if (PageCompound(page)) + pr_cont(" compound_mapcount: %d", compound_mapcount(page)); + pr_cont("\n"); BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS); dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names)); if (reason) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1de7ab5d1004..1588f688b75d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -989,7 +989,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, src_page = pmd_page(pmd); VM_BUG_ON_PAGE(!PageHead(src_page), src_page); get_page(src_page); - page_dup_rmap(src_page); + page_dup_rmap(src_page, true); add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); pmdp_set_wrprotect(src_mm, addr, src_pmd); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84af842e828d..12908dcf5831 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3102,7 +3102,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); get_page(ptepage); - page_dup_rmap(ptepage); + page_dup_rmap(ptepage, true); set_huge_pte_at(dst, addr, dst_pte, entry); hugetlb_count_add(pages_per_huge_page(h), dst); } @@ -3585,7 +3585,7 @@ retry: ClearPagePrivate(page); hugepage_add_new_anon_rmap(page, vma, address); } else - page_dup_rmap(page); + page_dup_rmap(page, true); new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3b8c845e0c2e..bee6b1c9fdce 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4899,6 +4899,14 @@ retry: switch (get_mctgt_type(vma, addr, ptent, &target)) { case MC_TARGET_PAGE: page = target.page; + /* + * We can have a part of the split pmd here. Moving it + * can be done but it would be too convoluted so simply + * ignore such a partial THP and keep it in original + * memcg. There should be somebody mapping the head. + */ + if (PageTransCompound(page)) + goto put; if (isolate_lru_page(page)) goto put; if (!mem_cgroup_move_account(page, false, diff --git a/mm/memory.c b/mm/memory.c index 3b656d1a8e07..9b0dbc2f0b9a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -864,7 +864,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, page = vm_normal_page(vma, addr, pte); if (page) { get_page(page); - page_dup_rmap(page); + page_dup_rmap(page, false); rss[mm_counter(page)]++; } diff --git a/mm/migrate.c b/mm/migrate.c index 3921f20f8de4..91545da23fd1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -165,7 +165,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, if (PageAnon(new)) hugepage_add_anon_rmap(new, vma, addr); else - page_dup_rmap(new); + page_dup_rmap(new, true); } else if (PageAnon(new)) page_add_anon_rmap(new, vma, addr, false); else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d02d6436add0..3221091da513 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -469,6 +469,7 @@ void prep_compound_page(struct page *page, unsigned int order) p->mapping = TAIL_MAPPING; set_compound_head(p, page); } + atomic_set(compound_mapcount_ptr(page), -1); } #ifdef CONFIG_DEBUG_PAGEALLOC @@ -733,7 +734,7 @@ static inline int free_pages_check(struct page *page) const char *bad_reason = NULL; unsigned long bad_flags = 0; - if (unlikely(page_mapcount(page))) + if (unlikely(atomic_read(&page->_mapcount) != -1)) bad_reason = "nonzero mapcount"; if (unlikely(page->mapping != NULL)) bad_reason = "non-NULL mapping"; @@ -857,7 +858,13 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) ret = 0; goto out; } - if (page->mapping != TAIL_MAPPING) { + /* mapping in first tail page is used for compound_mapcount() */ + if (page - head_page == 1) { + if (unlikely(compound_mapcount(page))) { + bad_page(page, "nonzero compound_mapcount", 0); + goto out; + } + } else if (page->mapping != TAIL_MAPPING) { bad_page(page, "corrupted mapping in tail page", 0); goto out; } @@ -1335,7 +1342,7 @@ static inline int check_new_page(struct page *page) const char *bad_reason = NULL; unsigned long bad_flags = 0; - if (unlikely(page_mapcount(page))) + if (unlikely(atomic_read(&page->_mapcount) != -1)) bad_reason = "nonzero mapcount"; if (unlikely(page->mapping != NULL)) bad_reason = "non-NULL mapping"; diff --git a/mm/rmap.c b/mm/rmap.c index aa68a4089a53..2e6257165527 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1122,7 +1122,7 @@ static void __page_check_anon_rmap(struct page *page, * over the call to page_add_new_anon_rmap. */ BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root); - BUG_ON(page->index != linear_page_index(vma, address)); + BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address)); #endif } @@ -1152,9 +1152,28 @@ void page_add_anon_rmap(struct page *page, void do_page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, int flags) { - int first = atomic_inc_and_test(&page->_mapcount); + bool compound = flags & RMAP_COMPOUND; + bool first; + + if (PageTransCompound(page)) { + VM_BUG_ON_PAGE(!PageLocked(page), page); + if (compound) { + atomic_t *mapcount; + + VM_BUG_ON_PAGE(!PageTransHuge(page), page); + mapcount = compound_mapcount_ptr(page); + first = atomic_inc_and_test(mapcount); + } else { + /* Anon THP always mapped first with PMD */ + first = 0; + VM_BUG_ON_PAGE(!page_mapcount(page), page); + atomic_inc(&page->_mapcount); + } + } else { + first = atomic_inc_and_test(&page->_mapcount); + } + if (first) { - bool compound = flags & RMAP_COMPOUND; int nr = compound ? hpage_nr_pages(page) : 1; /* * We use the irq-unsafe __{inc|mod}_zone_page_stat because @@ -1173,6 +1192,7 @@ void do_page_add_anon_rmap(struct page *page, return; VM_BUG_ON_PAGE(!PageLocked(page), page); + /* address might be in next vma when migration races vma_adjust */ if (first) __page_set_anon_rmap(page, vma, address, @@ -1199,10 +1219,16 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); SetPageSwapBacked(page); - atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ if (compound) { VM_BUG_ON_PAGE(!PageTransHuge(page), page); + /* increment count (starts at -1) */ + atomic_set(compound_mapcount_ptr(page), 0); __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); + } else { + /* Anon THP always mapped first with PMD */ + VM_BUG_ON_PAGE(PageTransCompound(page), page); + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); } __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr); __page_set_anon_rmap(page, vma, address, 1); @@ -1232,12 +1258,15 @@ static void page_remove_file_rmap(struct page *page) memcg = mem_cgroup_begin_page_stat(page); - /* page still mapped by someone else? */ - if (!atomic_add_negative(-1, &page->_mapcount)) + /* Hugepages are not counted in NR_FILE_MAPPED for now. */ + if (unlikely(PageHuge(page))) { + /* hugetlb pages are always mapped with pmds */ + atomic_dec(compound_mapcount_ptr(page)); goto out; + } - /* Hugepages are not counted in NR_FILE_MAPPED for now. */ - if (unlikely(PageHuge(page))) + /* page still mapped by someone else? */ + if (!atomic_add_negative(-1, &page->_mapcount)) goto out; /* @@ -1254,6 +1283,39 @@ out: mem_cgroup_end_page_stat(memcg); } +static void page_remove_anon_compound_rmap(struct page *page) +{ + int i, nr; + + if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) + return; + + /* Hugepages are not counted in NR_ANON_PAGES for now. */ + if (unlikely(PageHuge(page))) + return; + + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + return; + + __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); + + if (TestClearPageDoubleMap(page)) { + /* + * Subpages can be mapped with PTEs too. Check how many of + * themi are still mapped. + */ + for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) { + if (atomic_add_negative(-1, &page[i]._mapcount)) + nr++; + } + } else { + nr = HPAGE_PMD_NR; + } + + if (nr) + __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr); +} + /** * page_remove_rmap - take down pte mapping from a page * @page: page to remove mapping from @@ -1263,33 +1325,25 @@ out: */ void page_remove_rmap(struct page *page, bool compound) { - int nr = compound ? hpage_nr_pages(page) : 1; - if (!PageAnon(page)) { VM_BUG_ON_PAGE(compound && !PageHuge(page), page); page_remove_file_rmap(page); return; } + if (compound) + return page_remove_anon_compound_rmap(page); + /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) return; - /* Hugepages are not counted in NR_ANON_PAGES for now. */ - if (unlikely(PageHuge(page))) - return; - /* * We use the irq-unsafe __{inc|mod}_zone_page_stat because * these counters are not modified in interrupt context, and * pte lock(a spinlock) is held, which implies preemption disabled. */ - if (compound) { - VM_BUG_ON_PAGE(!PageTransHuge(page), page); - __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); - } - - __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr); + __dec_zone_page_state(page, NR_ANON_PAGES); if (unlikely(PageMlocked(page))) clear_page_mlock(page); @@ -1710,7 +1764,7 @@ void hugepage_add_anon_rmap(struct page *page, BUG_ON(!PageLocked(page)); BUG_ON(!anon_vma); /* address might be in next vma when migration races vma_adjust */ - first = atomic_inc_and_test(&page->_mapcount); + first = atomic_inc_and_test(compound_mapcount_ptr(page)); if (first) __hugepage_set_anon_rmap(page, vma, address, 0); } @@ -1719,7 +1773,7 @@ void hugepage_add_new_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { BUG_ON(address < vma->vm_start || address >= vma->vm_end); - atomic_set(&page->_mapcount, 0); + atomic_set(compound_mapcount_ptr(page), 0); __hugepage_set_anon_rmap(page, vma, address, 1); } #endif /* CONFIG_HUGETLB_PAGE */ -- cgit v1.2.3 From e1534ae95004d6a307839a44eed40389d608c935 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:53:46 -0800 Subject: mm: differentiate page_mapped() from page_mapcount() for compound pages Let's define page_mapped() to be true for compound pages if any sub-pages of the compound page is mapped (with PMD or PTE). On other hand page_mapcount() return mapcount for this particular small page. This will make cases like page_get_anon_vma() behave correctly once we allow huge pages to be mapped with PTE. Most users outside core-mm should use page_mapcount() instead of page_mapped(). Signed-off-by: Kirill A. Shutemov Tested-by: Sasha Levin Tested-by: Aneesh Kumar K.V Acked-by: Jerome Marchand Cc: Vlastimil Babka Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Dave Hansen Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Steve Capper Cc: Johannes Weiner Cc: Michal Hocko Cc: Christoph Lameter Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arc/mm/cache.c | 4 ++-- arch/arm/mm/flush.c | 2 +- arch/mips/mm/c-r4k.c | 3 ++- arch/mips/mm/cache.c | 2 +- arch/mips/mm/init.c | 6 +++--- arch/sh/mm/cache-sh4.c | 2 +- arch/sh/mm/cache.c | 8 ++++---- arch/xtensa/mm/tlb.c | 2 +- fs/proc/page.c | 4 ++-- include/linux/mm.h | 15 +++++++++++++-- mm/filemap.c | 2 +- 11 files changed, 31 insertions(+), 19 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/arc/mm/cache.c b/arch/arc/mm/cache.c index ff7ff6cbb811..b65f797e9ad6 100644 --- a/arch/arc/mm/cache.c +++ b/arch/arc/mm/cache.c @@ -617,7 +617,7 @@ void flush_dcache_page(struct page *page) */ if (!mapping_mapped(mapping)) { clear_bit(PG_dc_clean, &page->flags); - } else if (page_mapped(page)) { + } else if (page_mapcount(page)) { /* kernel reading from page with U-mapping */ phys_addr_t paddr = (unsigned long)page_address(page); @@ -857,7 +857,7 @@ void copy_user_highpage(struct page *to, struct page *from, * For !VIPT cache, all of this gets compiled out as * addr_not_cache_congruent() is 0 */ - if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) { + if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) { __flush_dcache_page((unsigned long)kfrom, u_vaddr); clean_src_k_mappings = 1; } diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c index c2acebb7eedc..d0ba3551d49a 100644 --- a/arch/arm/mm/flush.c +++ b/arch/arm/mm/flush.c @@ -330,7 +330,7 @@ void flush_dcache_page(struct page *page) mapping = page_mapping(page); if (!cache_ops_need_broadcast() && - mapping && !page_mapped(page)) + mapping && !page_mapcount(page)) clear_bit(PG_dcache_clean, &page->flags); else { __flush_dcache_page(mapping, page); diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c index 5d3a25e1cfae..caac3d747a90 100644 --- a/arch/mips/mm/c-r4k.c +++ b/arch/mips/mm/c-r4k.c @@ -587,7 +587,8 @@ static inline void local_r4k_flush_cache_page(void *args) * another ASID than the current one. */ map_coherent = (cpu_has_dc_aliases && - page_mapped(page) && !Page_dcache_dirty(page)); + page_mapcount(page) && + !Page_dcache_dirty(page)); if (map_coherent) vaddr = kmap_coherent(page, addr); else diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index aab218c36e0d..3f159caf6dbc 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr) unsigned long addr = (unsigned long) page_address(page); if (pages_do_alias(addr, vmaddr)) { - if (page_mapped(page) && !Page_dcache_dirty(page)) { + if (page_mapcount(page) && !Page_dcache_dirty(page)) { void *kaddr; kaddr = kmap_coherent(page, vmaddr); diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c index 8770e619185e..7e5fa0938c21 100644 --- a/arch/mips/mm/init.c +++ b/arch/mips/mm/init.c @@ -165,7 +165,7 @@ void copy_user_highpage(struct page *to, struct page *from, vto = kmap_atomic(to); if (cpu_has_dc_aliases && - page_mapped(from) && !Page_dcache_dirty(from)) { + page_mapcount(from) && !Page_dcache_dirty(from)) { vfrom = kmap_coherent(from, vaddr); copy_page(vto, vfrom); kunmap_coherent(); @@ -187,7 +187,7 @@ void copy_to_user_page(struct vm_area_struct *vma, unsigned long len) { if (cpu_has_dc_aliases && - page_mapped(page) && !Page_dcache_dirty(page)) { + page_mapcount(page) && !Page_dcache_dirty(page)) { void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); memcpy(vto, src, len); kunmap_coherent(); @@ -205,7 +205,7 @@ void copy_from_user_page(struct vm_area_struct *vma, unsigned long len) { if (cpu_has_dc_aliases && - page_mapped(page) && !Page_dcache_dirty(page)) { + page_mapcount(page) && !Page_dcache_dirty(page)) { void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); memcpy(dst, vfrom, len); kunmap_coherent(); diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c index 51d8f7f31d1d..58aaa4f33b81 100644 --- a/arch/sh/mm/cache-sh4.c +++ b/arch/sh/mm/cache-sh4.c @@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args) */ map_coherent = (current_cpu_data.dcache.n_aliases && test_bit(PG_dcache_clean, &page->flags) && - page_mapped(page)); + page_mapcount(page)); if (map_coherent) vaddr = kmap_coherent(page, address); else diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c index f770e3992620..e58cfbf45150 100644 --- a/arch/sh/mm/cache.c +++ b/arch/sh/mm/cache.c @@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, const void *src, unsigned long len) { - if (boot_cpu_data.dcache.n_aliases && page_mapped(page) && + if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && test_bit(PG_dcache_clean, &page->flags)) { void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); memcpy(vto, src, len); @@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, const void *src, unsigned long len) { - if (boot_cpu_data.dcache.n_aliases && page_mapped(page) && + if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && test_bit(PG_dcache_clean, &page->flags)) { void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); memcpy(dst, vfrom, len); @@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from, vto = kmap_atomic(to); - if (boot_cpu_data.dcache.n_aliases && page_mapped(from) && + if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) && test_bit(PG_dcache_clean, &from->flags)) { vfrom = kmap_coherent(from, vaddr); copy_page(vto, vfrom); @@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr) unsigned long addr = (unsigned long) page_address(page); if (pages_do_alias(addr, vmaddr)) { - if (boot_cpu_data.dcache.n_aliases && page_mapped(page) && + if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && test_bit(PG_dcache_clean, &page->flags)) { void *kaddr; diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c index 5ece856c5725..35c822286bbe 100644 --- a/arch/xtensa/mm/tlb.c +++ b/arch/xtensa/mm/tlb.c @@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb) page_mapcount(p)); if (!page_count(p)) rc |= TLB_INSANE; - else if (page_mapped(p)) + else if (page_mapcount(p)) rc |= TLB_SUSPICIOUS; } else { rc |= TLB_INSANE; diff --git a/fs/proc/page.c b/fs/proc/page.c index 93484034a03d..b2855eea5405 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -103,9 +103,9 @@ u64 stable_page_flags(struct page *page) * pseudo flags for the well known (anonymous) memory mapped pages * * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the - * simple test in page_mapped() is not enough. + * simple test in page_mapcount() is not enough. */ - if (!PageSlab(page) && page_mapped(page)) + if (!PageSlab(page) && page_mapcount(page)) u |= 1 << KPF_MMAP; if (PageAnon(page)) u |= 1 << KPF_ANON; diff --git a/include/linux/mm.h b/include/linux/mm.h index 67e0fab225e8..6b56cfd9fd09 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -953,10 +953,21 @@ static inline pgoff_t page_file_index(struct page *page) /* * Return true if this page is mapped into pagetables. + * For compound page it returns true if any subpage of compound page is mapped. */ -static inline int page_mapped(struct page *page) +static inline bool page_mapped(struct page *page) { - return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0; + int i; + if (likely(!PageCompound(page))) + return atomic_read(&page->_mapcount) >= 0; + page = compound_head(page); + if (atomic_read(compound_mapcount_ptr(page)) >= 0) + return true; + for (i = 0; i < hpage_nr_pages(page); i++) { + if (atomic_read(&page[i]._mapcount) >= 0) + return true; + } + return false; } /* diff --git a/mm/filemap.c b/mm/filemap.c index a729345ed6ec..847ee43c2806 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -204,7 +204,7 @@ void __delete_from_page_cache(struct page *page, void *shadow, __dec_zone_page_state(page, NR_FILE_PAGES); if (PageSwapBacked(page)) __dec_zone_page_state(page, NR_SHMEM); - BUG_ON(page_mapped(page)); + VM_BUG_ON_PAGE(page_mapped(page), page); /* * At this point page must be either written or cleaned by truncate. -- cgit v1.2.3 From 4e41a30c6d506c884d3da9aeb316352e70679d4b Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Fri, 15 Jan 2016 16:54:07 -0800 Subject: mm: hwpoison: adjust for new thp refcounting Some mm-related BUG_ON()s could trigger from hwpoison code due to recent changes in thp refcounting rule. This patch fixes them up. In the new refcounting, we no longer use tail->_mapcount to keep tail's refcount, and thereby we can simplify get/put_hwpoison_page(). And another change is that tail's refcount is not transferred to the raw page during thp split (more precisely, in new rule we don't take refcount on tail page any more.) So when we need thp split, we have to transfer the refcount properly to the 4kB soft-offlined page before migration. thp split code goes into core code only when precheck (total_mapcount(head) == page_count(head) - 1) passes to avoid useless split, where we assume that one refcount is held by the caller of thp split and the others are taken via mapping. To meet this assumption, this patch moves thp split part in soft_offline_page() after get_any_page(). [akpm@linux-foundation.org: remove unneeded #define, per Kirill] Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- mm/memory-failure.c | 73 +++++++++++++++-------------------------------------- 2 files changed, 22 insertions(+), 53 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 6b56cfd9fd09..e4397f640e86 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2217,7 +2217,7 @@ extern int memory_failure(unsigned long pfn, int trapno, int flags); extern void memory_failure_queue(unsigned long pfn, int trapno, int flags); extern int unpoison_memory(unsigned long pfn); extern int get_hwpoison_page(struct page *page); -extern void put_hwpoison_page(struct page *page); +#define put_hwpoison_page(page) put_page(page) extern int sysctl_memory_failure_early_kill; extern int sysctl_memory_failure_recovery; extern void shake_page(struct page *p, int access); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 05e079bf9425..1b99403d76f2 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -882,15 +882,7 @@ int get_hwpoison_page(struct page *page) { struct page *head = compound_head(page); - if (PageHuge(head)) - return get_page_unless_zero(head); - - /* - * Thp tail page has special refcounting rule (refcount of tail pages - * is stored in ->_mapcount,) so we can't call get_page_unless_zero() - * directly for tail pages. - */ - if (PageTransHuge(head)) { + if (!PageHuge(head) && PageTransHuge(head)) { /* * Non anonymous thp exists only in allocation/free time. We * can't handle such a case correctly, so let's give it up. @@ -902,41 +894,12 @@ int get_hwpoison_page(struct page *page) page_to_pfn(page)); return 0; } - - if (get_page_unless_zero(head)) { - if (PageTail(page)) - get_page(page); - return 1; - } else { - return 0; - } } - return get_page_unless_zero(page); + return get_page_unless_zero(head); } EXPORT_SYMBOL_GPL(get_hwpoison_page); -/** - * put_hwpoison_page() - Put refcount for memory error handling: - * @page: raw error page (hit by memory error) - */ -void put_hwpoison_page(struct page *page) -{ - struct page *head = compound_head(page); - - if (PageHuge(head)) { - put_page(head); - return; - } - - if (PageTransHuge(head)) - if (page != head) - put_page(head); - - put_page(page); -} -EXPORT_SYMBOL_GPL(put_hwpoison_page); - /* * Do all that is necessary to remove user space mappings. Unmap * the pages and send SIGBUS to the processes if the data was dirty. @@ -1162,6 +1125,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags) return -EBUSY; } unlock_page(hpage); + get_hwpoison_page(p); + put_hwpoison_page(hpage); VM_BUG_ON_PAGE(!page_count(p), p); hpage = compound_head(p); } @@ -1753,24 +1718,28 @@ int soft_offline_page(struct page *page, int flags) put_hwpoison_page(page); return -EBUSY; } - if (!PageHuge(page) && PageTransHuge(hpage)) { - lock_page(page); - ret = split_huge_page(hpage); - unlock_page(page); - if (unlikely(ret)) { - pr_info("soft offline: %#lx: failed to split THP\n", - pfn); - if (flags & MF_COUNT_INCREASED) - put_hwpoison_page(page); - return -EBUSY; - } - } get_online_mems(); - ret = get_any_page(page, pfn, flags); put_online_mems(); + if (ret > 0) { /* for in-use pages */ + if (!PageHuge(page) && PageTransHuge(hpage)) { + lock_page(hpage); + ret = split_huge_page(hpage); + unlock_page(hpage); + if (unlikely(ret || PageTransCompound(page) || + !PageAnon(page))) { + pr_info("soft offline: %#lx: failed to split THP\n", + pfn); + if (flags & MF_COUNT_INCREASED) + put_hwpoison_page(hpage); + return -EBUSY; + } + get_hwpoison_page(page); + put_hwpoison_page(hpage); + } + if (PageHuge(page)) ret = soft_offline_huge_page(page, flags); else -- cgit v1.2.3 From 9a982250f773cc8c76f1eee68a770b7cbf2faf78 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:54:17 -0800 Subject: thp: introduce deferred_split_huge_page() Currently we don't split huge page on partial unmap. It's not an ideal situation. It can lead to memory overhead. Furtunately, we can detect partial unmap on page_remove_rmap(). But we cannot call split_huge_page() from there due to locking context. It's also counterproductive to do directly from munmap() codepath: in many cases we will hit this from exit(2) and splitting the huge page just to free it up in small pages is not what we really want. The patch introduce deferred_split_huge_page() which put the huge page into queue for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. The page will be dropped from list on freeing through compound page destructor. Signed-off-by: Kirill A. Shutemov Tested-by: Sasha Levin Tested-by: Aneesh Kumar K.V Acked-by: Vlastimil Babka Acked-by: Jerome Marchand Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Dave Hansen Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Steve Capper Cc: Johannes Weiner Cc: Michal Hocko Cc: Christoph Lameter Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 5 ++ include/linux/mm.h | 5 ++ include/linux/mm_types.h | 2 + mm/huge_memory.c | 139 +++++++++++++++++++++++++++++++++++++++++++++-- mm/migrate.c | 1 + mm/page_alloc.c | 27 ++++++--- mm/rmap.c | 7 ++- 7 files changed, 174 insertions(+), 12 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 90e11e6a37ab..7aec5ee9cfdf 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -90,11 +90,15 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); extern unsigned long transparent_hugepage_flags; +extern void prep_transhuge_page(struct page *page); +extern void free_transhuge_page(struct page *page); + int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL); } +void deferred_split_huge_page(struct page *page); void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address); @@ -170,6 +174,7 @@ static inline int split_huge_page(struct page *page) { return 0; } +static inline void deferred_split_huge_page(struct page *page) {} #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) static inline int hugepage_madvise(struct vm_area_struct *vma, diff --git a/include/linux/mm.h b/include/linux/mm.h index e4397f640e86..aa8ae8330a75 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -507,6 +507,9 @@ enum compound_dtor_id { COMPOUND_PAGE_DTOR, #ifdef CONFIG_HUGETLB_PAGE HUGETLB_PAGE_DTOR, +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + TRANSHUGE_PAGE_DTOR, #endif NR_COMPOUND_DTORS, }; @@ -537,6 +540,8 @@ static inline void set_compound_order(struct page *page, unsigned int order) page[1].compound_order = order; } +void free_compound_page(struct page *page); + #ifdef CONFIG_MMU /* * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 809defe0597d..2dd9c313a8c0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -55,6 +55,7 @@ struct page { */ void *s_mem; /* slab first object */ atomic_t compound_mapcount; /* first tail page */ + /* page_deferred_list().next -- second tail page */ }; /* Second double word */ @@ -62,6 +63,7 @@ struct page { union { pgoff_t index; /* Our offset within mapping. */ void *freelist; /* sl[aou]b first free object */ + /* page_deferred_list().prev -- second tail page */ }; union { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b6ac6c43d6a4..4acf55b31f7c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -135,6 +135,10 @@ static struct khugepaged_scan khugepaged_scan = { .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head), }; +static DEFINE_SPINLOCK(split_queue_lock); +static LIST_HEAD(split_queue); +static unsigned long split_queue_len; +static struct shrinker deferred_split_shrinker; static void set_recommended_min_free_kbytes(void) { @@ -667,6 +671,9 @@ static int __init hugepage_init(void) err = register_shrinker(&huge_zero_page_shrinker); if (err) goto err_hzp_shrinker; + err = register_shrinker(&deferred_split_shrinker); + if (err) + goto err_split_shrinker; /* * By default disable transparent hugepages on smaller systems, @@ -684,6 +691,8 @@ static int __init hugepage_init(void) return 0; err_khugepaged: + unregister_shrinker(&deferred_split_shrinker); +err_split_shrinker: unregister_shrinker(&huge_zero_page_shrinker); err_hzp_shrinker: khugepaged_slab_exit(); @@ -740,6 +749,27 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot) return entry; } +static inline struct list_head *page_deferred_list(struct page *page) +{ + /* + * ->lru in the tail pages is occupied by compound_head. + * Let's use ->mapping + ->index in the second tail page as list_head. + */ + return (struct list_head *)&page[2].mapping; +} + +void prep_transhuge_page(struct page *page) +{ + /* + * we use page->mapping and page->indexlru in second tail page + * as list_head: assuming THP order >= 2 + */ + BUILD_BUG_ON(HPAGE_PMD_ORDER < 2); + + INIT_LIST_HEAD(page_deferred_list(page)); + set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); +} + static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, @@ -896,6 +926,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); return VM_FAULT_FALLBACK; } + prep_transhuge_page(page); return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp, flags); } @@ -1192,7 +1223,9 @@ alloc: } else new_page = NULL; - if (unlikely(!new_page)) { + if (likely(new_page)) { + prep_transhuge_page(new_page); + } else { if (!page) { split_huge_pmd(vma, pmd, address); ret |= VM_FAULT_FALLBACK; @@ -2109,6 +2142,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm, return NULL; } + prep_transhuge_page(*hpage); count_vm_event(THP_COLLAPSE_ALLOC); return *hpage; } @@ -2120,8 +2154,12 @@ static int khugepaged_find_target_node(void) static inline struct page *alloc_hugepage(int defrag) { - return alloc_pages(alloc_hugepage_gfpmask(defrag, 0), - HPAGE_PMD_ORDER); + struct page *page; + + page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER); + if (page) + prep_transhuge_page(page); + return page; } static struct page *khugepaged_alloc_hugepage(bool *wait) @@ -3098,7 +3136,7 @@ static int __split_huge_page_tail(struct page *head, int tail, set_page_idle(page_tail); /* ->mapping in first tail page is compound_mapcount */ - VM_BUG_ON_PAGE(tail != 1 && page_tail->mapping != TAIL_MAPPING, + VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, page_tail); page_tail->mapping = head->mapping; @@ -3207,12 +3245,20 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) freeze_page(anon_vma, head); VM_BUG_ON_PAGE(compound_mapcount(head), head); + /* Prevent deferred_split_scan() touching ->_count */ + spin_lock(&split_queue_lock); count = page_count(head); mapcount = total_mapcount(head); if (mapcount == count - 1) { + if (!list_empty(page_deferred_list(head))) { + split_queue_len--; + list_del(page_deferred_list(head)); + } + spin_unlock(&split_queue_lock); __split_huge_page(page, list); ret = 0; } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) { + spin_unlock(&split_queue_lock); pr_alert("total_mapcount: %u, page_count(): %u\n", mapcount, count); if (PageTail(page)) @@ -3220,6 +3266,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) dump_page(page, "total_mapcount(head) > page_count(head) - 1"); BUG(); } else { + spin_unlock(&split_queue_lock); unfreeze_page(anon_vma, head); ret = -EBUSY; } @@ -3231,3 +3278,87 @@ out: count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); return ret; } + +void free_transhuge_page(struct page *page) +{ + unsigned long flags; + + spin_lock_irqsave(&split_queue_lock, flags); + if (!list_empty(page_deferred_list(page))) { + split_queue_len--; + list_del(page_deferred_list(page)); + } + spin_unlock_irqrestore(&split_queue_lock, flags); + free_compound_page(page); +} + +void deferred_split_huge_page(struct page *page) +{ + unsigned long flags; + + VM_BUG_ON_PAGE(!PageTransHuge(page), page); + + spin_lock_irqsave(&split_queue_lock, flags); + if (list_empty(page_deferred_list(page))) { + list_add_tail(page_deferred_list(page), &split_queue); + split_queue_len++; + } + spin_unlock_irqrestore(&split_queue_lock, flags); +} + +static unsigned long deferred_split_count(struct shrinker *shrink, + struct shrink_control *sc) +{ + /* + * Split a page from split_queue will free up at least one page, + * at most HPAGE_PMD_NR - 1. We don't track exact number. + * Let's use HPAGE_PMD_NR / 2 as ballpark. + */ + return ACCESS_ONCE(split_queue_len) * HPAGE_PMD_NR / 2; +} + +static unsigned long deferred_split_scan(struct shrinker *shrink, + struct shrink_control *sc) +{ + unsigned long flags; + LIST_HEAD(list), *pos, *next; + struct page *page; + int split = 0; + + spin_lock_irqsave(&split_queue_lock, flags); + list_splice_init(&split_queue, &list); + + /* Take pin on all head pages to avoid freeing them under us */ + list_for_each_safe(pos, next, &list) { + page = list_entry((void *)pos, struct page, mapping); + page = compound_head(page); + /* race with put_compound_page() */ + if (!get_page_unless_zero(page)) { + list_del_init(page_deferred_list(page)); + split_queue_len--; + } + } + spin_unlock_irqrestore(&split_queue_lock, flags); + + list_for_each_safe(pos, next, &list) { + page = list_entry((void *)pos, struct page, mapping); + lock_page(page); + /* split_huge_page() removes page from list on success */ + if (!split_huge_page(page)) + split++; + unlock_page(page); + put_page(page); + } + + spin_lock_irqsave(&split_queue_lock, flags); + list_splice_tail(&list, &split_queue); + spin_unlock_irqrestore(&split_queue_lock, flags); + + return split * HPAGE_PMD_NR / 2; +} + +static struct shrinker deferred_split_shrinker = { + .count_objects = deferred_split_count, + .scan_objects = deferred_split_scan, + .seeks = DEFAULT_SEEKS, +}; diff --git a/mm/migrate.c b/mm/migrate.c index dec81a9e2fd6..b1034f9c77e7 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1760,6 +1760,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, HPAGE_PMD_ORDER); if (!new_page) goto out_fail; + prep_transhuge_page(new_page); isolated = numamigrate_isolate_page(pgdat, page); if (!isolated) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3221091da513..25409714160e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -222,13 +222,15 @@ static char * const zone_names[MAX_NR_ZONES] = { #endif }; -static void free_compound_page(struct page *page); compound_page_dtor * const compound_page_dtors[] = { NULL, free_compound_page, #ifdef CONFIG_HUGETLB_PAGE free_huge_page, #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + free_transhuge_page, +#endif }; int min_free_kbytes = 1024; @@ -450,7 +452,7 @@ out: * This usage means that zero-order pages may not be compound. */ -static void free_compound_page(struct page *page) +void free_compound_page(struct page *page) { __free_pages_ok(page, compound_order(page)); } @@ -858,15 +860,26 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) ret = 0; goto out; } - /* mapping in first tail page is used for compound_mapcount() */ - if (page - head_page == 1) { + switch (page - head_page) { + case 1: + /* the first tail page: ->mapping is compound_mapcount() */ if (unlikely(compound_mapcount(page))) { bad_page(page, "nonzero compound_mapcount", 0); goto out; } - } else if (page->mapping != TAIL_MAPPING) { - bad_page(page, "corrupted mapping in tail page", 0); - goto out; + break; + case 2: + /* + * the second tail page: ->mapping is + * page_deferred_list().next -- ignore value. + */ + break; + default: + if (page->mapping != TAIL_MAPPING) { + bad_page(page, "corrupted mapping in tail page", 0); + goto out; + } + break; } if (unlikely(!PageTail(page))) { bad_page(page, "PageTail not set", 0); diff --git a/mm/rmap.c b/mm/rmap.c index fc707df92ede..84271cc39d1e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1282,8 +1282,10 @@ static void page_remove_anon_compound_rmap(struct page *page) nr = HPAGE_PMD_NR; } - if (nr) + if (nr) { __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr); + deferred_split_huge_page(page); + } } /** @@ -1318,6 +1320,9 @@ void page_remove_rmap(struct page *page, bool compound) if (unlikely(PageMlocked(page))) clear_page_mlock(page); + if (PageTransCompound(page)) + deferred_split_huge_page(compound_head(page)); + /* * It would be tidy to reset the PageAnon mapping here, * but that might overwrite a racing page_add_anon_rmap -- cgit v1.2.3 From b20ce5e03b936be077463015661dcf52be274e5b Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 15 Jan 2016 16:54:37 -0800 Subject: mm: prepare page_referenced() and page_idle to new THP refcounting Both page_referenced() and page_idle_clear_pte_refs_one() assume that THP can only be mapped with PMD, so there's no reason to look on PTEs for PageTransHuge() pages. That's no true anymore: THP can be mapped with PTEs too. The patch removes PageTransHuge() test from the functions and opencode page table check. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kirill A. Shutemov Cc: Vladimir Davydov Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Naoya Horiguchi Cc: Sasha Levin Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 5 --- include/linux/mm.h | 23 ++++++---- mm/huge_memory.c | 73 ++++++++---------------------- mm/page_idle.c | 65 +++++++++++++++++++++++---- mm/rmap.c | 117 +++++++++++++++++++++++++++++++++--------------- mm/util.c | 14 ++++++ 6 files changed, 185 insertions(+), 112 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7aec5ee9cfdf..72cd942edb22 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -48,11 +48,6 @@ enum transparent_hugepage_flag { #endif }; -extern pmd_t *page_check_address_pmd(struct page *page, - struct mm_struct *mm, - unsigned long address, - spinlock_t **ptl); - #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<_mapcount, -1); } +int __page_mapcount(struct page *page); + static inline int page_mapcount(struct page *page) { - int ret; VM_BUG_ON_PAGE(PageSlab(page), page); - ret = atomic_read(&page->_mapcount) + 1; - if (PageCompound(page)) { - page = compound_head(page); - ret += atomic_read(compound_mapcount_ptr(page)) + 1; - if (PageDoubleMap(page)) - ret--; - } - return ret; + if (unlikely(PageCompound(page))) + return __page_mapcount(page); + return atomic_read(&page->_mapcount) + 1; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +int total_mapcount(struct page *page); +#else +static inline int total_mapcount(struct page *page) +{ + return page_mapcount(page); } +#endif static inline int page_count(struct page *page) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f283cb7c480e..ab544b145b52 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1649,46 +1649,6 @@ bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma, return false; } -/* - * This function returns whether a given @page is mapped onto the @address - * in the virtual space of @mm. - * - * When it's true, this function returns *pmd with holding the page table lock - * and passing it back to the caller via @ptl. - * If it's false, returns NULL without holding the page table lock. - */ -pmd_t *page_check_address_pmd(struct page *page, - struct mm_struct *mm, - unsigned long address, - spinlock_t **ptl) -{ - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - - if (address & ~HPAGE_PMD_MASK) - return NULL; - - pgd = pgd_offset(mm, address); - if (!pgd_present(*pgd)) - return NULL; - pud = pud_offset(pgd, address); - if (!pud_present(*pud)) - return NULL; - pmd = pmd_offset(pud, address); - - *ptl = pmd_lock(mm, pmd); - if (!pmd_present(*pmd)) - goto unlock; - if (pmd_page(*pmd) != page) - goto unlock; - if (pmd_trans_huge(*pmd)) - return pmd; -unlock: - spin_unlock(*ptl); - return NULL; -} - #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE) int hugepage_madvise(struct vm_area_struct *vma, @@ -3097,20 +3057,6 @@ static void unfreeze_page(struct anon_vma *anon_vma, struct page *page) } } -static int total_mapcount(struct page *page) -{ - int i, ret; - - ret = compound_mapcount(page); - for (i = 0; i < HPAGE_PMD_NR; i++) - ret += atomic_read(&page[i]._mapcount) + 1; - - if (PageDoubleMap(page)) - ret -= HPAGE_PMD_NR; - - return ret; -} - static int __split_huge_page_tail(struct page *head, int tail, struct lruvec *lruvec, struct list_head *list) { @@ -3211,6 +3157,25 @@ static void __split_huge_page(struct page *page, struct list_head *list) } } +int total_mapcount(struct page *page) +{ + int i, ret; + + VM_BUG_ON_PAGE(PageTail(page), page); + + if (likely(!PageCompound(page))) + return atomic_read(&page->_mapcount) + 1; + + ret = compound_mapcount(page); + if (PageHuge(page)) + return ret; + for (i = 0; i < HPAGE_PMD_NR; i++) + ret += atomic_read(&page[i]._mapcount) + 1; + if (PageDoubleMap(page)) + ret -= HPAGE_PMD_NR; + return ret; +} + /* * This function splits huge page into normal pages. @page can point to any * subpage of huge page to split. Split doesn't change the position of @page. diff --git a/mm/page_idle.c b/mm/page_idle.c index 1c245d9027e3..2c553ba969f8 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -56,23 +56,70 @@ static int page_idle_clear_pte_refs_one(struct page *page, { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; + pgd_t *pgd; + pud_t *pud; pmd_t *pmd; pte_t *pte; bool referenced = false; - if (unlikely(PageTransHuge(page))) { - pmd = page_check_address_pmd(page, mm, addr, &ptl); - if (pmd) { - referenced = pmdp_clear_young_notify(vma, addr, pmd); + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + return SWAP_AGAIN; + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + return SWAP_AGAIN; + pmd = pmd_offset(pud, addr); + + if (pmd_trans_huge(*pmd)) { + ptl = pmd_lock(mm, pmd); + if (!pmd_present(*pmd)) + goto unlock_pmd; + if (unlikely(!pmd_trans_huge(*pmd))) { spin_unlock(ptl); + goto map_pte; } + + if (pmd_page(*pmd) != page) + goto unlock_pmd; + + referenced = pmdp_clear_young_notify(vma, addr, pmd); + spin_unlock(ptl); + goto found; +unlock_pmd: + spin_unlock(ptl); + return SWAP_AGAIN; } else { - pte = page_check_address(page, mm, addr, &ptl, 0); - if (pte) { - referenced = ptep_clear_young_notify(vma, addr, pte); - pte_unmap_unlock(pte, ptl); - } + pmd_t pmde = *pmd; + + barrier(); + if (!pmd_present(pmde) || pmd_trans_huge(pmde)) + return SWAP_AGAIN; + + } +map_pte: + pte = pte_offset_map(pmd, addr); + if (!pte_present(*pte)) { + pte_unmap(pte); + return SWAP_AGAIN; } + + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); + + if (!pte_present(*pte)) { + pte_unmap_unlock(pte, ptl); + return SWAP_AGAIN; + } + + /* THP can be referenced by any subpage */ + if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) { + pte_unmap_unlock(pte, ptl); + return SWAP_AGAIN; + } + + referenced = ptep_clear_young_notify(vma, addr, pte); + pte_unmap_unlock(pte, ptl); +found: if (referenced) { clear_page_idle(page); /* diff --git a/mm/rmap.c b/mm/rmap.c index 31d8866fb562..6127c00b2262 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -814,58 +814,105 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, spinlock_t *ptl; int referenced = 0; struct page_referenced_arg *pra = arg; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; - if (unlikely(PageTransHuge(page))) { - pmd_t *pmd; - - /* - * rmap might return false positives; we must filter - * these out using page_check_address_pmd(). - */ - pmd = page_check_address_pmd(page, mm, address, &ptl); - if (!pmd) + if (unlikely(PageHuge(page))) { + /* when pud is not present, pte will be NULL */ + pte = huge_pte_offset(mm, address); + if (!pte) return SWAP_AGAIN; - if (vma->vm_flags & VM_LOCKED) { + ptl = huge_pte_lockptr(page_hstate(page), mm, pte); + goto check_pte; + } + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + return SWAP_AGAIN; + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return SWAP_AGAIN; + pmd = pmd_offset(pud, address); + + if (pmd_trans_huge(*pmd)) { + int ret = SWAP_AGAIN; + + ptl = pmd_lock(mm, pmd); + if (!pmd_present(*pmd)) + goto unlock_pmd; + if (unlikely(!pmd_trans_huge(*pmd))) { spin_unlock(ptl); + goto map_pte; + } + + if (pmd_page(*pmd) != page) + goto unlock_pmd; + + if (vma->vm_flags & VM_LOCKED) { pra->vm_flags |= VM_LOCKED; - return SWAP_FAIL; /* To break the loop */ + ret = SWAP_FAIL; /* To break the loop */ + goto unlock_pmd; } if (pmdp_clear_flush_young_notify(vma, address, pmd)) referenced++; spin_unlock(ptl); + goto found; +unlock_pmd: + spin_unlock(ptl); + return ret; } else { - pte_t *pte; + pmd_t pmde = *pmd; - /* - * rmap might return false positives; we must filter - * these out using page_check_address(). - */ - pte = page_check_address(page, mm, address, &ptl, 0); - if (!pte) + barrier(); + if (!pmd_present(pmde) || pmd_trans_huge(pmde)) return SWAP_AGAIN; + } +map_pte: + pte = pte_offset_map(pmd, address); + if (!pte_present(*pte)) { + pte_unmap(pte); + return SWAP_AGAIN; + } - if (vma->vm_flags & VM_LOCKED) { - pte_unmap_unlock(pte, ptl); - pra->vm_flags |= VM_LOCKED; - return SWAP_FAIL; /* To break the loop */ - } + ptl = pte_lockptr(mm, pmd); +check_pte: + spin_lock(ptl); - if (ptep_clear_flush_young_notify(vma, address, pte)) { - /* - * Don't treat a reference through a sequentially read - * mapping as such. If the page has been used in - * another mapping, we will catch it; if this other - * mapping is already gone, the unmap path will have - * set PG_referenced or activated the page. - */ - if (likely(!(vma->vm_flags & VM_SEQ_READ))) - referenced++; - } + if (!pte_present(*pte)) { + pte_unmap_unlock(pte, ptl); + return SWAP_AGAIN; + } + + /* THP can be referenced by any subpage */ + if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) { + pte_unmap_unlock(pte, ptl); + return SWAP_AGAIN; + } + + if (vma->vm_flags & VM_LOCKED) { pte_unmap_unlock(pte, ptl); + pra->vm_flags |= VM_LOCKED; + return SWAP_FAIL; /* To break the loop */ } + if (ptep_clear_flush_young_notify(vma, address, pte)) { + /* + * Don't treat a reference through a sequentially read + * mapping as such. If the page has been used in + * another mapping, we will catch it; if this other + * mapping is already gone, the unmap path will have + * set PG_referenced or activated the page. + */ + if (likely(!(vma->vm_flags & VM_SEQ_READ))) + referenced++; + } + pte_unmap_unlock(pte, ptl); + +found: if (referenced) clear_page_idle(page); if (test_and_clear_page_young(page)) @@ -912,7 +959,7 @@ int page_referenced(struct page *page, int ret; int we_locked = 0; struct page_referenced_arg pra = { - .mapcount = page_mapcount(page), + .mapcount = total_mapcount(page), .memcg = memcg, }; struct rmap_walk_control rwc = { diff --git a/mm/util.c b/mm/util.c index 8acb936a52c8..6d1f9200f74e 100644 --- a/mm/util.c +++ b/mm/util.c @@ -407,6 +407,20 @@ struct address_space *page_mapping(struct page *page) return mapping; } +/* Slow path of page_mapcount() for compound pages */ +int __page_mapcount(struct page *page) +{ + int ret; + + ret = atomic_read(&page->_mapcount) + 1; + page = compound_head(page); + ret += atomic_read(compound_mapcount_ptr(page)) + 1; + if (PageDoubleMap(page)) + ret--; + return ret; +} +EXPORT_SYMBOL_GPL(__page_mapcount); + int overcommit_ratio_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) -- cgit v1.2.3 From 260ae3f7db614a5c4aa4b773599f99adc1d9859e Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 15 Jan 2016 16:56:17 -0800 Subject: mm: skip memory block registration for ZONE_DEVICE Prevent userspace from trying and failing to online ZONE_DEVICE pages which are meant to never be onlined. For example on platforms with a udev rule like the following: SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online" ...will generate futile attempts to online the ZONE_DEVICE sections. Example kernel messages: Built 1 zonelists in Node order, mobility grouping on. Total pages: 1004747 Policy zone: Normal online_pages [mem 0x248000000-0x24fffffff] failed Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 13 +++++++++++++ include/linux/mm.h | 12 ++++++++++++ 2 files changed, 25 insertions(+) (limited to 'include/linux/mm.h') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 619fe584a44c..213456c2b123 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -647,6 +647,13 @@ static int add_memory_block(int base_section_nr) return 0; } +static bool is_zone_device_section(struct mem_section *ms) +{ + struct page *page; + + page = sparse_decode_mem_map(ms->section_mem_map, __section_nr(ms)); + return is_zone_device_page(page); +} /* * need an interface for the VM to add new memory regions, @@ -657,6 +664,9 @@ int register_new_memory(int nid, struct mem_section *section) int ret = 0; struct memory_block *mem; + if (is_zone_device_section(section)) + return 0; + mutex_lock(&mem_sysfs_mutex); mem = find_memory_block(section); @@ -693,6 +703,9 @@ static int remove_memory_section(unsigned long node_id, { struct memory_block *mem; + if (is_zone_device_section(section)) + return 0; + mutex_lock(&mem_sysfs_mutex); mem = find_memory_block(section); unregister_mem_sect_under_nodes(mem, __section_nr(section)); diff --git a/include/linux/mm.h b/include/linux/mm.h index 0ef5f21735af..d9fe12d45c21 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -674,6 +674,18 @@ static inline enum zone_type page_zonenum(const struct page *page) return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK; } +#ifdef CONFIG_ZONE_DEVICE +static inline bool is_zone_device_page(const struct page *page) +{ + return page_zonenum(page) == ZONE_DEVICE; +} +#else +static inline bool is_zone_device_page(const struct page *page) +{ + return false; +} +#endif + #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif -- cgit v1.2.3 From 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 15 Jan 2016 16:56:22 -0800 Subject: x86, mm: introduce vmem_altmap to augment vmemmap_populate() In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. Signed-off-by: Dan Williams Reported-by: kbuild test robot Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/mm/init_64.c | 33 ++++++++++++++---- drivers/nvdimm/pmem.c | 6 ++-- include/linux/memory_hotplug.h | 3 +- include/linux/memremap.h | 39 +++++++++++++++++++--- include/linux/mm.h | 9 ++++- kernel/memremap.c | 72 +++++++++++++++++++++++++++++++++++++-- mm/memory_hotplug.c | 67 ++++++++++++++++++++++++++----------- mm/page_alloc.c | 11 +++++- mm/sparse-vmemmap.c | 76 ++++++++++++++++++++++++++++++++++++++++-- mm/sparse.c | 8 +++-- 10 files changed, 282 insertions(+), 42 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 8829482d69ec..5488d21123bd 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -714,6 +715,12 @@ static void __meminit free_pagetable(struct page *page, int order) { unsigned long magic; unsigned int nr_pages = 1 << order; + struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page); + + if (altmap) { + vmem_altmap_free(altmap, nr_pages); + return; + } /* bootmem page has reserved flag */ if (PageReserved(page)) { @@ -1017,13 +1024,19 @@ int __ref arch_remove_memory(u64 start, u64 size) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; + struct page *page = pfn_to_page(start_pfn); + struct vmem_altmap *altmap; struct zone *zone; int ret; - zone = page_zone(pfn_to_page(start_pfn)); - kernel_physical_mapping_remove(start, start + size); + /* With altmap the first mapped page is offset from @start */ + altmap = to_vmem_altmap((unsigned long) page); + if (altmap) + page += vmem_altmap_offset(altmap); + zone = page_zone(page); ret = __remove_pages(zone, start_pfn, nr_pages); WARN_ON_ONCE(ret); + kernel_physical_mapping_remove(start, start + size); return ret; } @@ -1235,7 +1248,7 @@ static void __meminitdata *p_start, *p_end; static int __meminitdata node_start; static int __meminit vmemmap_populate_hugepages(unsigned long start, - unsigned long end, int node) + unsigned long end, int node, struct vmem_altmap *altmap) { unsigned long addr; unsigned long next; @@ -1258,7 +1271,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start, if (pmd_none(*pmd)) { void *p; - p = vmemmap_alloc_block_buf(PMD_SIZE, node); + p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); if (p) { pte_t entry; @@ -1279,7 +1292,8 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start, addr_end = addr + PMD_SIZE; p_end = p + PMD_SIZE; continue; - } + } else if (altmap) + return -ENOMEM; /* no fallback */ } else if (pmd_large(*pmd)) { vmemmap_verify((pte_t *)pmd, node, addr, next); continue; @@ -1293,11 +1307,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start, int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) { + struct vmem_altmap *altmap = to_vmem_altmap(start); int err; if (cpu_has_pse) - err = vmemmap_populate_hugepages(start, end, node); - else + err = vmemmap_populate_hugepages(start, end, node, altmap); + else if (altmap) { + pr_err_once("%s: no cpu support for altmap allocations\n", + __func__); + err = -ENOMEM; + } else err = vmemmap_populate_basepages(start, end, node); if (!err) sync_global_pgds(start, end - 1, 0); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 904629b97c4f..be3f8547b702 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -178,7 +178,8 @@ static struct pmem_device *pmem_alloc(struct device *dev, pmem->pfn_flags = PFN_DEV; if (pmem_should_map_pages(dev)) { - pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res); + pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res, + NULL); pmem->pfn_flags |= PFN_MAP; } else pmem->virt_addr = (void __pmem *) devm_memremap(dev, @@ -387,7 +388,8 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns) /* establish pfn range for lookup, and switch to direct map */ pmem = dev_get_drvdata(dev); devm_memunmap(dev, (void __force *) pmem->virt_addr); - pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res); + pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res, + NULL); pmem->pfn_flags |= PFN_MAP; if (IS_ERR(pmem->virt_addr)) { rc = PTR_ERR(pmem->virt_addr); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 2ea574ff9714..43405992d027 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -275,7 +275,8 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); extern void remove_memory(int nid, u64 start, u64 size); extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn); -extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms); +extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, + unsigned long map_offset); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index d90721c178bb..aa3e82a80d7b 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -4,21 +4,53 @@ struct resource; struct device; + +/** + * struct vmem_altmap - pre-allocated storage for vmemmap_populate + * @base_pfn: base of the entire dev_pagemap mapping + * @reserve: pages mapped, but reserved for driver use (relative to @base) + * @free: free pages set aside in the mapping for memmap storage + * @align: pages reserved to meet allocation alignments + * @alloc: track pages consumed, private to vmemmap_populate() + */ +struct vmem_altmap { + const unsigned long base_pfn; + const unsigned long reserve; + unsigned long free; + unsigned long align; + unsigned long alloc; +}; + +unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); +void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) && defined(CONFIG_ZONE_DEVICE) +struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start); +#else +static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start) +{ + return NULL; +} +#endif + /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings + * @altmap: pre-allocated/reserved memory for vmemmap allocations * @dev: host device of the mapping for debug */ struct dev_pagemap { - /* TODO: vmem_altmap and percpu_ref count */ + struct vmem_altmap *altmap; + const struct resource *res; struct device *dev; }; #ifdef CONFIG_ZONE_DEVICE -void *devm_memremap_pages(struct device *dev, struct resource *res); +void *devm_memremap_pages(struct device *dev, struct resource *res, + struct vmem_altmap *altmap); struct dev_pagemap *find_dev_pagemap(resource_size_t phys); #else static inline void *devm_memremap_pages(struct device *dev, - struct resource *res) + struct resource *res, struct vmem_altmap *altmap) { /* * Fail attempts to call devm_memremap_pages() without @@ -34,5 +66,4 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys) return NULL; } #endif - #endif /* _LINUX_MEMREMAP_H_ */ diff --git a/include/linux/mm.h b/include/linux/mm.h index d9fe12d45c21..8bb0907a3603 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2217,7 +2217,14 @@ pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node); pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node); void *vmemmap_alloc_block(unsigned long size, int node); -void *vmemmap_alloc_block_buf(unsigned long size, int node); +struct vmem_altmap; +void *__vmemmap_alloc_block_buf(unsigned long size, int node, + struct vmem_altmap *altmap); +static inline void *vmemmap_alloc_block_buf(unsigned long size, int node) +{ + return __vmemmap_alloc_block_buf(size, node, NULL); +} + void vmemmap_verify(pte_t *, int, unsigned long, unsigned long); int vmemmap_populate_basepages(unsigned long start, unsigned long end, int node); diff --git a/kernel/memremap.c b/kernel/memremap.c index 61cfbf4d3054..562f6471fe90 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -166,6 +166,7 @@ struct page_map { struct resource res; struct percpu_ref *ref; struct dev_pagemap pgmap; + struct vmem_altmap altmap; }; static void pgmap_radix_release(struct resource *res) @@ -183,6 +184,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data) struct page_map *page_map = data; struct resource *res = &page_map->res; resource_size_t align_start, align_size; + struct dev_pagemap *pgmap = &page_map->pgmap; pgmap_radix_release(res); @@ -190,6 +192,8 @@ static void devm_memremap_pages_release(struct device *dev, void *data) align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(resource_size(res), SECTION_SIZE); arch_remove_memory(align_start, align_size); + dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc, + "%s: failed to free all reserved pages\n", __func__); } /* assumes rcu_read_lock() held at entry */ @@ -203,11 +207,23 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) return page_map ? &page_map->pgmap : NULL; } -void *devm_memremap_pages(struct device *dev, struct resource *res) +/** + * devm_memremap_pages - remap and provide memmap backing for the given resource + * @dev: hosting device for @res + * @res: "host memory" address range + * @altmap: optional descriptor for allocating the memmap from @res + * + * Note, the expectation is that @res is a host memory range that could + * feasibly be treated as a "System RAM" range, i.e. not a device mmio + * range, but this is not enforced. + */ +void *devm_memremap_pages(struct device *dev, struct resource *res, + struct vmem_altmap *altmap) { int is_ram = region_intersects(res->start, resource_size(res), "System RAM"); resource_size_t key, align_start, align_size; + struct dev_pagemap *pgmap; struct page_map *page_map; int error, nid; @@ -220,14 +236,27 @@ void *devm_memremap_pages(struct device *dev, struct resource *res) if (is_ram == REGION_INTERSECTS) return __va(res->start); + if (altmap && !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) { + dev_err(dev, "%s: altmap requires CONFIG_SPARSEMEM_VMEMMAP=y\n", + __func__); + return ERR_PTR(-ENXIO); + } + page_map = devres_alloc_node(devm_memremap_pages_release, sizeof(*page_map), GFP_KERNEL, dev_to_node(dev)); if (!page_map) return ERR_PTR(-ENOMEM); + pgmap = &page_map->pgmap; memcpy(&page_map->res, res, sizeof(*res)); - page_map->pgmap.dev = dev; + pgmap->dev = dev; + if (altmap) { + memcpy(&page_map->altmap, altmap, sizeof(*altmap)); + pgmap->altmap = &page_map->altmap; + } + pgmap->res = &page_map->res; + mutex_lock(&pgmap_lock); error = 0; for (key = res->start; key <= res->end; key += SECTION_SIZE) { @@ -273,4 +302,43 @@ void *devm_memremap_pages(struct device *dev, struct resource *res) return ERR_PTR(error); } EXPORT_SYMBOL(devm_memremap_pages); + +unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) +{ + /* number of pfns from base where pfn_to_page() is valid */ + return altmap->reserve + altmap->free; +} + +void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns) +{ + altmap->alloc -= nr_pfns; +} + +#ifdef CONFIG_SPARSEMEM_VMEMMAP +struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start) +{ + /* + * 'memmap_start' is the virtual address for the first "struct + * page" in this range of the vmemmap array. In the case of + * CONFIG_SPARSE_VMEMMAP a page_to_pfn conversion is simple + * pointer arithmetic, so we can perform this to_vmem_altmap() + * conversion without concern for the initialization state of + * the struct page fields. + */ + struct page *page = (struct page *) memmap_start; + struct dev_pagemap *pgmap; + + /* + * Uncoditionally retrieve a dev_pagemap associated with the + * given physical address, this is only for use in the + * arch_{add|remove}_memory() for setting up and tearing down + * the memmap. + */ + rcu_read_lock(); + pgmap = find_dev_pagemap(__pfn_to_phys(page_to_pfn(page))); + rcu_read_unlock(); + + return pgmap ? pgmap->altmap : NULL; +} +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ #endif /* CONFIG_ZONE_DEVICE */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 92f95952692b..4af58a3a8ffa 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -506,10 +507,25 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn, unsigned long i; int err = 0; int start_sec, end_sec; + struct vmem_altmap *altmap; + /* during initialize mem_map, align hot-added range to section */ start_sec = pfn_to_section_nr(phys_start_pfn); end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1); + altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn)); + if (altmap) { + /* + * Validate altmap is within bounds of the total request + */ + if (altmap->base_pfn != phys_start_pfn + || vmem_altmap_offset(altmap) > nr_pages) { + pr_warn_once("memory add fail, invalid altmap\n"); + return -EINVAL; + } + altmap->alloc = 0; + } + for (i = start_sec; i <= end_sec; i++) { err = __add_section(nid, zone, section_nr_to_pfn(i)); @@ -731,7 +747,8 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn) pgdat_resize_unlock(zone->zone_pgdat, &flags); } -static int __remove_section(struct zone *zone, struct mem_section *ms) +static int __remove_section(struct zone *zone, struct mem_section *ms, + unsigned long map_offset) { unsigned long start_pfn; int scn_nr; @@ -748,7 +765,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) start_pfn = section_nr_to_pfn(scn_nr); __remove_zone(zone, start_pfn); - sparse_remove_one_section(zone, ms); + sparse_remove_one_section(zone, ms, map_offset); return 0; } @@ -767,9 +784,32 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn, unsigned long nr_pages) { unsigned long i; - int sections_to_remove; - resource_size_t start, size; - int ret = 0; + unsigned long map_offset = 0; + int sections_to_remove, ret = 0; + + /* In the ZONE_DEVICE case device driver owns the memory region */ + if (is_dev_zone(zone)) { + struct page *page = pfn_to_page(phys_start_pfn); + struct vmem_altmap *altmap; + + altmap = to_vmem_altmap((unsigned long) page); + if (altmap) + map_offset = vmem_altmap_offset(altmap); + } else { + resource_size_t start, size; + + start = phys_start_pfn << PAGE_SHIFT; + size = nr_pages * PAGE_SIZE; + + ret = release_mem_region_adjustable(&iomem_resource, start, + size); + if (ret) { + resource_size_t endres = start + size - 1; + + pr_warn("Unable to release resource <%pa-%pa> (%d)\n", + &start, &endres, ret); + } + } /* * We can only remove entire sections @@ -777,23 +817,12 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn, BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK); BUG_ON(nr_pages % PAGES_PER_SECTION); - start = phys_start_pfn << PAGE_SHIFT; - size = nr_pages * PAGE_SIZE; - - /* in the ZONE_DEVICE case device driver owns the memory region */ - if (!is_dev_zone(zone)) - ret = release_mem_region_adjustable(&iomem_resource, start, size); - if (ret) { - resource_size_t endres = start + size - 1; - - pr_warn("Unable to release resource <%pa-%pa> (%d)\n", - &start, &endres, ret); - } - sections_to_remove = nr_pages / PAGES_PER_SECTION; for (i = 0; i < sections_to_remove; i++) { unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION; - ret = __remove_section(zone, __pfn_to_section(pfn)); + + ret = __remove_section(zone, __pfn_to_section(pfn), map_offset); + map_offset = 0; if (ret) break; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 25409714160e..7ca3fe6ef92d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -4485,8 +4486,9 @@ static inline unsigned long wait_table_bits(unsigned long size) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - pg_data_t *pgdat = NODE_DATA(nid); + struct vmem_altmap *altmap = to_vmem_altmap(__pfn_to_phys(start_pfn)); unsigned long end_pfn = start_pfn + size; + pg_data_t *pgdat = NODE_DATA(nid); unsigned long pfn; struct zone *z; unsigned long nr_initialised = 0; @@ -4494,6 +4496,13 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; + /* + * Honor reservation requested by the driver for this ZONE_DEVICE + * memory + */ + if (altmap && start_pfn == altmap->base_pfn) + start_pfn += altmap->reserve; + z = &pgdat->node_zones[zone]; for (pfn = start_pfn; pfn < end_pfn; pfn++) { /* diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 4cba9c2783a1..b60802b3e5ea 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -70,7 +71,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node) } /* need to make sure size is all the same during early stage */ -void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node) +static void * __meminit alloc_block_buf(unsigned long size, int node) { void *ptr; @@ -87,6 +88,77 @@ void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node) return ptr; } +static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap) +{ + return altmap->base_pfn + altmap->reserve + altmap->alloc + + altmap->align; +} + +static unsigned long __meminit vmem_altmap_nr_free(struct vmem_altmap *altmap) +{ + unsigned long allocated = altmap->alloc + altmap->align; + + if (altmap->free > allocated) + return altmap->free - allocated; + return 0; +} + +/** + * vmem_altmap_alloc - allocate pages from the vmem_altmap reservation + * @altmap - reserved page pool for the allocation + * @nr_pfns - size (in pages) of the allocation + * + * Allocations are aligned to the size of the request + */ +static unsigned long __meminit vmem_altmap_alloc(struct vmem_altmap *altmap, + unsigned long nr_pfns) +{ + unsigned long pfn = vmem_altmap_next_pfn(altmap); + unsigned long nr_align; + + nr_align = 1UL << find_first_bit(&nr_pfns, BITS_PER_LONG); + nr_align = ALIGN(pfn, nr_align) - pfn; + + if (nr_pfns + nr_align > vmem_altmap_nr_free(altmap)) + return ULONG_MAX; + altmap->alloc += nr_pfns; + altmap->align += nr_align; + return pfn + nr_align; +} + +static void * __meminit altmap_alloc_block_buf(unsigned long size, + struct vmem_altmap *altmap) +{ + unsigned long pfn, nr_pfns; + void *ptr; + + if (size & ~PAGE_MASK) { + pr_warn_once("%s: allocations must be multiple of PAGE_SIZE (%ld)\n", + __func__, size); + return NULL; + } + + nr_pfns = size >> PAGE_SHIFT; + pfn = vmem_altmap_alloc(altmap, nr_pfns); + if (pfn < ULONG_MAX) + ptr = __va(__pfn_to_phys(pfn)); + else + ptr = NULL; + pr_debug("%s: pfn: %#lx alloc: %ld align: %ld nr: %#lx\n", + __func__, pfn, altmap->alloc, altmap->align, nr_pfns); + + return ptr; +} + +/* need to make sure size is all the same during early stage */ +void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node, + struct vmem_altmap *altmap) +{ + if (altmap) + return altmap_alloc_block_buf(size, altmap); + return alloc_block_buf(size, node); +} + void __meminit vmemmap_verify(pte_t *pte, int node, unsigned long start, unsigned long end) { @@ -103,7 +175,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node) pte_t *pte = pte_offset_kernel(pmd, addr); if (pte_none(*pte)) { pte_t entry; - void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node); + void *p = alloc_block_buf(PAGE_SIZE, node); if (!p) return NULL; entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); diff --git a/mm/sparse.c b/mm/sparse.c index d1b48b691ac8..3717ceed4177 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -748,7 +748,7 @@ static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages) if (!memmap) return; - for (i = 0; i < PAGES_PER_SECTION; i++) { + for (i = 0; i < nr_pages; i++) { if (PageHWPoison(&memmap[i])) { atomic_long_sub(1, &num_poisoned_pages); ClearPageHWPoison(&memmap[i]); @@ -788,7 +788,8 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap) free_map_bootmem(memmap); } -void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) +void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, + unsigned long map_offset) { struct page *memmap = NULL; unsigned long *usemap = NULL, flags; @@ -804,7 +805,8 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) } pgdat_resize_unlock(pgdat, &flags); - clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION); + clear_hwpoisoned_pages(memmap + map_offset, + PAGES_PER_SECTION - map_offset); free_section_usemap(memmap, usemap); } #endif /* CONFIG_MEMORY_HOTREMOVE */ -- cgit v1.2.3 From 01c8f1c44b83a0825b573e7c723b033cece37b86 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 15 Jan 2016 16:56:40 -0800 Subject: mm, dax, gpu: convert vm_insert_mixed to pfn_t Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers _PAGE_DEVMAP to be set in the resulting pte. There are no functional changes to the gpu drivers as a result of this conversion. Signed-off-by: Dan Williams Cc: Dave Hansen Cc: David Airlie Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 5 +++++ drivers/gpu/drm/exynos/exynos_drm_gem.c | 4 +++- drivers/gpu/drm/gma500/framebuffer.c | 4 +++- drivers/gpu/drm/msm/msm_gem.c | 4 +++- drivers/gpu/drm/omapdrm/omap_gem.c | 7 +++++-- drivers/gpu/drm/ttm/ttm_bo_vm.c | 4 +++- fs/dax.c | 2 +- include/linux/mm.h | 2 +- include/linux/pfn_t.h | 27 +++++++++++++++++++++++++++ mm/memory.c | 16 ++++++++++------ 10 files changed, 61 insertions(+), 14 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 4973cc9eacce..4c668f15a532 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -247,6 +247,11 @@ static inline pte_t pte_mkspecial(pte_t pte) return pte_set_flags(pte, _PAGE_SPECIAL); } +static inline pte_t pte_mkdevmap(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); +} + static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) { pmdval_t v = native_pmd_val(pmd); diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c index 252eb301470c..32358c5e3db4 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_gem.c +++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c @@ -14,6 +14,7 @@ #include #include +#include #include #include "exynos_drm_drv.h" @@ -490,7 +491,8 @@ int exynos_drm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) } pfn = page_to_pfn(exynos_gem->pages[page_offset]); - ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn); + ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, + __pfn_to_pfn_t(pfn, PFN_DEV)); out: switch (ret) { diff --git a/drivers/gpu/drm/gma500/framebuffer.c b/drivers/gpu/drm/gma500/framebuffer.c index 2eaf1b31c7bd..72bc979fa0dc 100644 --- a/drivers/gpu/drm/gma500/framebuffer.c +++ b/drivers/gpu/drm/gma500/framebuffer.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -132,7 +133,8 @@ static int psbfb_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf) for (i = 0; i < page_num; i++) { pfn = (phys_addr >> PAGE_SHIFT); - ret = vm_insert_mixed(vma, address, pfn); + ret = vm_insert_mixed(vma, address, + __pfn_to_pfn_t(pfn, PFN_DEV)); if (unlikely((ret == -EBUSY) || (ret != 0 && i > 0))) break; else if (unlikely(ret != 0)) { diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c index c76cc853b08a..3cedb8d5c855 100644 --- a/drivers/gpu/drm/msm/msm_gem.c +++ b/drivers/gpu/drm/msm/msm_gem.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "msm_drv.h" #include "msm_gem.h" @@ -222,7 +223,8 @@ int msm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address, pfn, pfn << PAGE_SHIFT); - ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn); + ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, + __pfn_to_pfn_t(pfn, PFN_DEV)); out_unlock: mutex_unlock(&dev->struct_mutex); diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c index 7ed08fdc4c42..ceba5459ceb7 100644 --- a/drivers/gpu/drm/omapdrm/omap_gem.c +++ b/drivers/gpu/drm/omapdrm/omap_gem.c @@ -19,6 +19,7 @@ #include #include +#include #include @@ -385,7 +386,8 @@ static int fault_1d(struct drm_gem_object *obj, VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address, pfn, pfn << PAGE_SHIFT); - return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn); + return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, + __pfn_to_pfn_t(pfn, PFN_DEV)); } /* Special handling for the case of faulting in 2d tiled buffers */ @@ -478,7 +480,8 @@ static int fault_2d(struct drm_gem_object *obj, pfn, pfn << PAGE_SHIFT); for (i = n; i > 0; i--) { - vm_insert_mixed(vma, (unsigned long)vaddr, pfn); + vm_insert_mixed(vma, (unsigned long)vaddr, + __pfn_to_pfn_t(pfn, PFN_DEV)); pfn += usergart[fmt].stride_pfn; vaddr += PAGE_SIZE * m; } diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 8fb7213277cc..06d26dc438b2 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -229,7 +230,8 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf) } if (vma->vm_flags & VM_MIXEDMAP) - ret = vm_insert_mixed(&cvma, address, pfn); + ret = vm_insert_mixed(&cvma, address, + __pfn_to_pfn_t(pfn, PFN_DEV)); else ret = vm_insert_pfn(&cvma, address, pfn); diff --git a/fs/dax.c b/fs/dax.c index 6b13d6cd9a9a..574763eed8a3 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -363,7 +363,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } dax_unmap_atomic(bdev, &dax); - error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(dax.pfn)); + error = vm_insert_mixed(vma, vaddr, dax.pfn); out: i_mmap_unlock_read(mapping); diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bb0907a3603..a9902152449f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2107,7 +2107,7 @@ int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn); + pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h index c557a0e0b20c..bdaa275d7623 100644 --- a/include/linux/pfn_t.h +++ b/include/linux/pfn_t.h @@ -64,4 +64,31 @@ static inline pfn_t page_to_pfn_t(struct page *page) { return pfn_to_pfn_t(page_to_pfn(page)); } + +static inline int pfn_t_valid(pfn_t pfn) +{ + return pfn_valid(pfn_t_to_pfn(pfn)); +} + +#ifdef CONFIG_MMU +static inline pte_t pfn_t_pte(pfn_t pfn, pgprot_t pgprot) +{ + return pfn_pte(pfn_t_to_pfn(pfn), pgprot); +} +#endif + +#ifdef __HAVE_ARCH_PTE_DEVMAP +static inline bool pfn_t_devmap(pfn_t pfn) +{ + const unsigned long flags = PFN_DEV|PFN_MAP; + + return (pfn.val & flags) == flags; +} +#else +static inline bool pfn_t_devmap(pfn_t pfn) +{ + return false; +} +pte_t pte_mkdevmap(pte_t pte); +#endif #endif /* _LINUX_PFN_T_H_ */ diff --git a/mm/memory.c b/mm/memory.c index 5a73c6ed8e5c..7f03652723ea 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -1500,7 +1501,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1516,7 +1517,10 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, goto out_unlock; /* Ok, finally just insert the thing.. */ - entry = pte_mkspecial(pfn_pte(pfn, prot)); + if (pfn_t_devmap(pfn)) + entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); + else + entry = pte_mkspecial(pfn_t_pte(pfn, prot)); set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1566,14 +1570,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL; - ret = insert_pfn(vma, addr, pfn, pgprot); + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); return ret; } EXPORT_SYMBOL(vm_insert_pfn); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) + pfn_t pfn) { BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); @@ -1587,10 +1591,10 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ - if (!HAVE_PTE_SPECIAL && pfn_valid(pfn)) { + if (!HAVE_PTE_SPECIAL && pfn_t_valid(pfn)) { struct page *page; - page = pfn_to_page(pfn); + page = pfn_t_to_page(pfn); return insert_page(vma, addr, page, vma->vm_page_prot); } return insert_pfn(vma, addr, pfn, vma->vm_page_prot); -- cgit v1.2.3 From 5c7fb56e5e3f7035dd798a8e1adee639f87043e5 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 15 Jan 2016 16:56:52 -0800 Subject: mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd A dax-huge-page mapping while it uses some thp helpers is ultimately not a transparent huge page. The distinction is especially important in the get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds from pmd_huge() and pmd_trans_huge() which have slightly different semantics. Explicitly mark the pmd_trans_huge() helpers that dax needs by adding pmd_devmap() checks. [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()] Signed-off-by: Dan Williams Cc: Dave Hansen Cc: Mel Gorman Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Matthew Wilcox Signed-off-by: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 9 ++++++++- include/linux/huge_mm.h | 5 +++-- include/linux/mm.h | 7 +++++++ mm/huge_memory.c | 38 +++++++++++++++++++++----------------- mm/memory.c | 8 ++++---- mm/mprotect.c | 5 +++-- mm/pgtable-generic.c | 2 +- 7 files changed, 47 insertions(+), 27 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6585a8b10fea..6a0ad82c8d0f 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -164,13 +164,20 @@ static inline int pmd_large(pmd_t pte) #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline int pmd_trans_huge(pmd_t pmd) { - return pmd_val(pmd) & _PAGE_PSE; + return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; } static inline int has_transparent_hugepage(void) { return cpu_has_pse; } + +#ifdef __HAVE_ARCH_PTE_DEVMAP +static inline int pmd_devmap(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_DEVMAP); +} +#endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t pte_set_flags(pte_t pte, pteval_t set) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8ca35a131904..d39fa60bd6bf 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -104,7 +104,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, #define split_huge_pmd(__vma, __pmd, __address) \ do { \ pmd_t *____pmd = (__pmd); \ - if (pmd_trans_huge(*____pmd)) \ + if (pmd_trans_huge(*____pmd) \ + || pmd_devmap(*____pmd)) \ __split_huge_pmd(__vma, __pmd, __address); \ } while (0) @@ -124,7 +125,7 @@ static inline bool pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma, spinlock_t **ptl) { VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma); - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) return __pmd_trans_huge_lock(pmd, vma, ptl); else return false; diff --git a/include/linux/mm.h b/include/linux/mm.h index a9902152449f..cd123272d28d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -329,6 +329,13 @@ struct inode; #define page_private(page) ((page)->private) #define set_page_private(page, v) ((page)->private = (v)) +#if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) +static inline int pmd_devmap(pmd_t pmd) +{ + return 0; +} +#endif + /* * FIXME: take this include out, include page-flags.h in * files which need it (119 of them) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d93706013a55..82bed2bec3ed 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -995,7 +995,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pmd = *src_pmd; - if (unlikely(!pmd_trans_huge(pmd))) { + if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) { pte_free(dst_mm, pgtable); goto out_unlock; } @@ -1018,17 +1018,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, goto out_unlock; } - src_page = pmd_page(pmd); - VM_BUG_ON_PAGE(!PageHead(src_page), src_page); - get_page(src_page); - page_dup_rmap(src_page, true); - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + if (pmd_trans_huge(pmd)) { + /* thp accounting separate from pmd_devmap accounting */ + src_page = pmd_page(pmd); + VM_BUG_ON_PAGE(!PageHead(src_page), src_page); + get_page(src_page); + page_dup_rmap(src_page, true); + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + atomic_long_inc(&dst_mm->nr_ptes); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + } pmdp_set_wrprotect(src_mm, addr, src_pmd); pmd = pmd_mkold(pmd_wrprotect(pmd)); - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); set_pmd_at(dst_mm, addr, dst_pmd, pmd); - atomic_long_inc(&dst_mm->nr_ptes); ret = 0; out_unlock: @@ -1716,7 +1719,7 @@ bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma, spinlock_t **ptl) { *ptl = pmd_lock(vma->vm_mm, pmd); - if (likely(pmd_trans_huge(*pmd))) + if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) return true; spin_unlock(*ptl); return false; @@ -2788,7 +2791,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma); - VM_BUG_ON(!pmd_trans_huge(*pmd)); + VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)); count_vm_event(THP_SPLIT_PMD); @@ -2901,14 +2904,15 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_trans_huge(*pmd))) + if (pmd_trans_huge(*pmd)) { + page = pmd_page(*pmd); + if (PageMlocked(page)) + get_page(page); + else + page = NULL; + } else if (!pmd_devmap(*pmd)) goto out; - page = pmd_page(*pmd); __split_huge_pmd_locked(vma, pmd, haddr, false); - if (PageMlocked(page)) - get_page(page); - else - page = NULL; out: spin_unlock(ptl); mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); @@ -2938,7 +2942,7 @@ static void split_huge_pmd_address(struct vm_area_struct *vma, return; pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd)) + if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd))) return; /* * Caller holds the mmap_sem write mode, so a huge pmd cannot diff --git a/mm/memory.c b/mm/memory.c index 552ae3d69435..ff17850a52d9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -950,7 +950,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*src_pmd)) { + if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) { int err; VM_BUG_ON(next-addr != HPAGE_PMD_SIZE); err = copy_huge_pmd(dst_mm, src_mm, @@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) { + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) { #ifdef CONFIG_DEBUG_VM if (!rwsem_is_locked(&tlb->mm->mmap_sem)) { @@ -3375,7 +3375,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; barrier(); - if (pmd_trans_huge(orig_pmd)) { + if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { unsigned int dirty = flags & FAULT_FLAG_WRITE; if (pmd_protnone(orig_pmd)) @@ -3404,7 +3404,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unlikely(__pte_alloc(mm, vma, pmd, address))) return VM_FAULT_OOM; /* if an huge pmd materialized from under us just retry later */ - if (unlikely(pmd_trans_huge(*pmd))) + if (unlikely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) return 0; /* * A regular pmd is established and it can't morph into a huge pmd diff --git a/mm/mprotect.c b/mm/mprotect.c index 6047707085c1..8eb7bb40dc40 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -149,7 +149,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, unsigned long this_pages; next = pmd_addr_end(addr, end); - if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) + if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd) + && pmd_none_or_clear_bad(pmd)) continue; /* invoke the mmu notifier if the pmd is populated */ @@ -158,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(mm, mni_start, end); } - if (pmd_trans_huge(*pmd)) { + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) split_huge_pmd(vma, pmd, addr); else { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index c311a2ec6fea..9d4767698a1c 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -132,7 +132,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address, { pmd_t pmd; VM_BUG_ON(address & ~HPAGE_PMD_MASK); - VM_BUG_ON(!pmd_trans_huge(*pmdp)); + VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp)); pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return pmd; -- cgit v1.2.3 From 3565fce3a6597e91b8dee3e8e36ebf70f8b7ef9b Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 15 Jan 2016 16:56:55 -0800 Subject: mm, x86: get_user_pages() for dax mappings A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver has established a devm_memremap_pages() mapping, i.e. when the pfn_t return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when encountering _PAGE_DEVMAP during a page table walk we lookup and pin a struct dev_pagemap instance to keep the result of pfn_to_page() valid until put_page(). Signed-off-by: Dan Williams Tested-by: Logan Gunthorpe Cc: Dave Hansen Cc: Mel Gorman Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 7 ++++ arch/x86/mm/gup.c | 57 ++++++++++++++++++++++++++++++-- include/linux/huge_mm.h | 10 +++++- include/linux/mm.h | 59 +++++++++++++++++++++++---------- kernel/memremap.c | 12 +++++++ mm/gup.c | 30 +++++++++++++++-- mm/huge_memory.c | 75 +++++++++++++++++++++++++++++++++--------- mm/swap.c | 1 + 8 files changed, 212 insertions(+), 39 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6a0ad82c8d0f..0687c4748b8f 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -479,6 +479,13 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } +#ifdef __HAVE_ARCH_PTE_DEVMAP +static inline int pte_devmap(pte_t a) +{ + return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP; +} +#endif + #define pte_accessible pte_accessible static inline bool pte_accessible(struct mm_struct *mm, pte_t a) { diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index f8cb3e8ac250..6d5eb5900372 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -9,6 +9,7 @@ #include #include #include +#include #include @@ -63,6 +64,16 @@ retry: #endif } +static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages) +{ + while ((*nr) - nr_start) { + struct page *page = pages[--(*nr)]; + + ClearPageReferenced(page); + put_page(page); + } +} + /* * The performance critical leaf functions are made noinline otherwise gcc * inlines everything into a single function which results in too much @@ -71,7 +82,9 @@ retry: static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { + struct dev_pagemap *pgmap = NULL; unsigned long mask; + int nr_start = *nr; pte_t *ptep; mask = _PAGE_PRESENT|_PAGE_USER; @@ -89,13 +102,21 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, return 0; } - if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { + page = pte_page(pte); + if (pte_devmap(pte)) { + pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); + if (unlikely(!pgmap)) { + undo_dev_pagemap(nr, nr_start, pages); + pte_unmap(ptep); + return 0; + } + } else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - page = pte_page(pte); get_page(page); + put_dev_pagemap(pgmap); SetPageReferenced(page); pages[*nr] = page; (*nr)++; @@ -114,6 +135,32 @@ static inline void get_head_page_multiple(struct page *page, int nr) SetPageReferenced(page); } +static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr, + unsigned long end, struct page **pages, int *nr) +{ + int nr_start = *nr; + unsigned long pfn = pmd_pfn(pmd); + struct dev_pagemap *pgmap = NULL; + + pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; + do { + struct page *page = pfn_to_page(pfn); + + pgmap = get_dev_pagemap(pfn, pgmap); + if (unlikely(!pgmap)) { + undo_dev_pagemap(nr, nr_start, pages); + return 0; + } + SetPageReferenced(page); + pages[*nr] = page; + get_page(page); + put_dev_pagemap(pgmap); + (*nr)++; + pfn++; + } while (addr += PAGE_SIZE, addr != end); + return 1; +} + static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -126,9 +173,13 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, mask |= _PAGE_RW; if ((pmd_flags(pmd) & mask) != mask) return 0; + + VM_BUG_ON(!pfn_valid(pmd_pfn(pmd))); + if (pmd_devmap(pmd)) + return __gup_device_huge_pmd(pmd, addr, end, pages, nr); + /* hugepages are never "special" */ VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL); - VM_BUG_ON(!pfn_valid(pmd_pfn(pmd))); refs = 0; head = pmd_page(pmd); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index d39fa60bd6bf..cfe81e10bd54 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -38,7 +38,6 @@ extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, int prot_numa); int vmf_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *, pfn_t pfn, bool write); - enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, @@ -55,6 +54,9 @@ enum transparent_hugepage_flag { #define HPAGE_PMD_NR (1< #include #include +#include #include #include #include @@ -465,17 +466,6 @@ static inline int page_count(struct page *page) return atomic_read(&compound_head(page)->_count); } -static inline void get_page(struct page *page) -{ - page = compound_head(page); - /* - * Getting a normal page or the head of a compound page - * requires to already have an elevated page->_count. - */ - VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); - atomic_inc(&page->_count); -} - static inline struct page *virt_to_head_page(const void *x) { struct page *page = virt_to_page(x); @@ -494,13 +484,6 @@ static inline void init_page_count(struct page *page) void __put_page(struct page *page); -static inline void put_page(struct page *page) -{ - page = compound_head(page); - if (put_page_testzero(page)) - __put_page(page); -} - void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); @@ -682,17 +665,50 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE +void get_zone_device_page(struct page *page); +void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else +static inline void get_zone_device_page(struct page *page) +{ +} +static inline void put_zone_device_page(struct page *page) +{ +} static inline bool is_zone_device_page(const struct page *page) { return false; } #endif +static inline void get_page(struct page *page) +{ + page = compound_head(page); + /* + * Getting a normal page or the head of a compound page + * requires to already have an elevated page->_count. + */ + VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); + atomic_inc(&page->_count); + + if (unlikely(is_zone_device_page(page))) + get_zone_device_page(page); +} + +static inline void put_page(struct page *page) +{ + page = compound_head(page); + + if (put_page_testzero(page)) + __put_page(page); + + if (unlikely(is_zone_device_page(page))) + put_zone_device_page(page); +} + #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif @@ -1444,6 +1460,13 @@ static inline void sync_mm_rss(struct mm_struct *mm) } #endif +#ifndef __HAVE_ARCH_PTE_DEVMAP +static inline int pte_devmap(pte_t pte) +{ + return 0; +} +#endif + int vma_wants_writenotify(struct vm_area_struct *vma); extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, diff --git a/kernel/memremap.c b/kernel/memremap.c index 3eb8944265d5..e517a16cb426 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -169,6 +169,18 @@ struct page_map { struct vmem_altmap altmap; }; +void get_zone_device_page(struct page *page) +{ + percpu_ref_get(page->pgmap->ref); +} +EXPORT_SYMBOL(get_zone_device_page); + +void put_zone_device_page(struct page *page) +{ + put_dev_pagemap(page->pgmap); +} +EXPORT_SYMBOL(put_zone_device_page); + static void pgmap_radix_release(struct resource *res) { resource_size_t key; diff --git a/mm/gup.c b/mm/gup.c index e95b0cb6ed81..aa21c4b865a5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -4,6 +4,7 @@ #include #include +#include #include #include #include @@ -62,6 +63,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; + struct dev_pagemap *pgmap = NULL; struct page *page; spinlock_t *ptl; pte_t *ptep, pte; @@ -98,7 +100,17 @@ retry: } page = vm_normal_page(vma, address, pte); - if (unlikely(!page)) { + if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { + /* + * Only return device mapping pages in the FOLL_GET case since + * they are only valid while holding the pgmap reference. + */ + pgmap = get_dev_pagemap(pte_pfn(pte), NULL); + if (pgmap) + page = pte_page(pte); + else + goto no_page; + } else if (unlikely(!page)) { if (flags & FOLL_DUMP) { /* Avoid special (like zero) pages in core dumps */ page = ERR_PTR(-EFAULT); @@ -129,8 +141,15 @@ retry: goto retry; } - if (flags & FOLL_GET) + if (flags & FOLL_GET) { get_page(page); + + /* drop the pgmap reference now that we hold the page */ + if (pgmap) { + put_dev_pagemap(pgmap); + pgmap = NULL; + } + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -237,6 +256,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma, } if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) return no_page_table(vma, flags); + if (pmd_devmap(*pmd)) { + ptl = pmd_lock(mm, pmd); + page = follow_devmap_pmd(vma, address, pmd, flags); + spin_unlock(ptl); + if (page) + return page; + } if (likely(!pmd_trans_huge(*pmd))) return follow_page_pte(vma, address, pmd, flags); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82bed2bec3ed..b2db98136af9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -974,6 +975,63 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, return VM_FAULT_NOPAGE; } +static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd) +{ + pmd_t _pmd; + + /* + * We should set the dirty bit only for FOLL_WRITE but for now + * the dirty bit in the pmd is meaningless. And if the dirty + * bit will become meaningful and we'll only set it with + * FOLL_WRITE, an atomic set_bit will be required on the pmd to + * set the young bit, instead of the current set_pmd_at. + */ + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); + if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, + pmd, _pmd, 1)) + update_mmu_cache_pmd(vma, addr, pmd); +} + +struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, int flags) +{ + unsigned long pfn = pmd_pfn(*pmd); + struct mm_struct *mm = vma->vm_mm; + struct dev_pagemap *pgmap; + struct page *page; + + assert_spin_locked(pmd_lockptr(mm, pmd)); + + if (flags & FOLL_WRITE && !pmd_write(*pmd)) + return NULL; + + if (pmd_present(*pmd) && pmd_devmap(*pmd)) + /* pass */; + else + return NULL; + + if (flags & FOLL_TOUCH) + touch_pmd(vma, addr, pmd); + + /* + * device mapped pages can only be returned if the + * caller will manage the page reference count. + */ + if (!(flags & FOLL_GET)) + return ERR_PTR(-EEXIST); + + pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; + pgmap = get_dev_pagemap(pfn, NULL); + if (!pgmap) + return ERR_PTR(-EFAULT); + page = pfn_to_page(pfn); + get_page(page); + put_dev_pagemap(pgmap); + + return page; +} + int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) @@ -1331,21 +1389,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page), page); - if (flags & FOLL_TOUCH) { - pmd_t _pmd; - /* - * We should set the dirty bit only for FOLL_WRITE but - * for now the dirty bit in the pmd is meaningless. - * And if the dirty bit will become meaningful and - * we'll only set it with FOLL_WRITE, an atomic - * set_bit will be required on the pmd to set the - * young bit, instead of the current set_pmd_at. - */ - _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, - pmd, _pmd, 1)) - update_mmu_cache_pmd(vma, addr, pmd); - } + if (flags & FOLL_TOUCH) + touch_pmd(vma, addr, pmd); if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * We don't mlock() pte-mapped THPs. This way we can avoid diff --git a/mm/swap.c b/mm/swap.c index 674e2c93da4e..09fe5e97714a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include -- cgit v1.2.3 From 4a9e1cda274893eca7d178d7dc265503ccb9d87a Mon Sep 17 00:00:00 2001 From: Dominik Dingel Date: Fri, 15 Jan 2016 16:57:04 -0800 Subject: mm: bring in additional flag for fixup_user_fault to signal unlock During Jason's work with postcopy migration support for s390 a problem regarding gmap faults was discovered. The gmap code will call fixup_user_fault which will end up always in handle_mm_fault. Till now we never cared about retries, but as the userfaultfd code kind of relies on it. this needs some fix. This patchset does not take care of the futex code. I will now look closer at this. This patch (of 2): With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the faulting we ever unlocked mmap_sem. This patch brings in the logic to handle retries as well as it cleans up the current documentation. fixup_user_fault was not having the same semantics as filemap_fault. It never indicated if a retry happened and so a caller wasn't able to handle that case. So we now changed the behaviour to always retry a locked mmap_sem. Signed-off-by: Dominik Dingel Reviewed-by: Andrea Arcangeli Cc: "Kirill A. Shutemov" Cc: Martin Schwidefsky Cc: Christian Borntraeger Cc: "Jason J. Herne" Cc: David Rientjes Cc: Eric B Munson Cc: Naoya Horiguchi Cc: Mel Gorman Cc: Heiko Carstens Cc: Dominik Dingel Cc: Paolo Bonzini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/s390/mm/pgtable.c | 8 +++++--- include/linux/mm.h | 5 +++-- kernel/futex.c | 2 +- mm/gup.c | 30 +++++++++++++++++++++++++----- 4 files changed, 34 insertions(+), 11 deletions(-) (limited to 'include/linux/mm.h') diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 4e54492f463a..84bddda8d412 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -585,7 +585,7 @@ int gmap_fault(struct gmap *gmap, unsigned long gaddr, rc = vmaddr; goto out_up; } - if (fixup_user_fault(current, gmap->mm, vmaddr, fault_flags)) { + if (fixup_user_fault(current, gmap->mm, vmaddr, fault_flags, NULL)) { rc = -EFAULT; goto out_up; } @@ -727,7 +727,8 @@ int gmap_ipte_notify(struct gmap *gmap, unsigned long gaddr, unsigned long len) break; } /* Get the page mapped */ - if (fixup_user_fault(current, gmap->mm, addr, FAULT_FLAG_WRITE)) { + if (fixup_user_fault(current, gmap->mm, addr, FAULT_FLAG_WRITE, + NULL)) { rc = -EFAULT; break; } @@ -802,7 +803,8 @@ retry: if (!(pte_val(*ptep) & _PAGE_INVALID) && (pte_val(*ptep) & _PAGE_PROTECT)) { pte_unmap_unlock(ptep, ptl); - if (fixup_user_fault(current, mm, addr, FAULT_FLAG_WRITE)) { + if (fixup_user_fault(current, mm, addr, FAULT_FLAG_WRITE, + NULL)) { up_read(&mm->mmap_sem); return -EFAULT; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 792f2469c142..1d6ec55d8b25 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1194,7 +1194,8 @@ int invalidate_inode_page(struct page *page); extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, - unsigned long address, unsigned int fault_flags); + unsigned long address, unsigned int fault_flags, + bool *unlocked); #else static inline int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, @@ -1206,7 +1207,7 @@ static inline int handle_mm_fault(struct mm_struct *mm, } static inline int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, unsigned long address, - unsigned int fault_flags) + unsigned int fault_flags, bool *unlocked) { /* should never happen if there's no MMU */ BUG(); diff --git a/kernel/futex.c b/kernel/futex.c index eed92a8a4c49..c6f514573b28 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -604,7 +604,7 @@ static int fault_in_user_writeable(u32 __user *uaddr) down_read(&mm->mmap_sem); ret = fixup_user_fault(current, mm, (unsigned long)uaddr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE, NULL); up_read(&mm->mmap_sem); return ret < 0 ? ret : 0; diff --git a/mm/gup.c b/mm/gup.c index aa21c4b865a5..b64a36175884 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -618,6 +618,8 @@ EXPORT_SYMBOL(__get_user_pages); * @mm: mm_struct of target mm * @address: user address * @fault_flags:flags to pass down to handle_mm_fault() + * @unlocked: did we unlock the mmap_sem while retrying, maybe NULL if caller + * does not allow retry * * This is meant to be called in the specific scenario where for locking reasons * we try to access user memory in atomic context (within a pagefault_disable() @@ -629,22 +631,28 @@ EXPORT_SYMBOL(__get_user_pages); * The main difference with get_user_pages() is that this function will * unconditionally call handle_mm_fault() which will in turn perform all the * necessary SW fixup of the dirty and young bits in the PTE, while - * handle_mm_fault() only guarantees to update these in the struct page. + * get_user_pages() only guarantees to update these in the struct page. * * This is important for some architectures where those bits also gate the * access permission to the page because they are maintained in software. On * such architectures, gup() will not be enough to make a subsequent access * succeed. * - * This has the same semantics wrt the @mm->mmap_sem as does filemap_fault(). + * This function will not return with an unlocked mmap_sem. So it has not the + * same semantics wrt the @mm->mmap_sem as does filemap_fault(). */ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, - unsigned long address, unsigned int fault_flags) + unsigned long address, unsigned int fault_flags, + bool *unlocked) { struct vm_area_struct *vma; vm_flags_t vm_flags; - int ret; + int ret, major = 0; + + if (unlocked) + fault_flags |= FAULT_FLAG_ALLOW_RETRY; +retry: vma = find_extend_vma(mm, address); if (!vma || address < vma->vm_start) return -EFAULT; @@ -654,6 +662,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, return -EFAULT; ret = handle_mm_fault(mm, vma, address, fault_flags); + major |= ret & VM_FAULT_MAJOR; if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) return -ENOMEM; @@ -663,8 +672,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, return -EFAULT; BUG(); } + + if (ret & VM_FAULT_RETRY) { + down_read(&mm->mmap_sem); + if (!(fault_flags & FAULT_FLAG_TRIED)) { + *unlocked = true; + fault_flags &= ~FAULT_FLAG_ALLOW_RETRY; + fault_flags |= FAULT_FLAG_TRIED; + goto retry; + } + } + if (tsk) { - if (ret & VM_FAULT_MAJOR) + if (major) tsk->maj_flt++; else tsk->min_flt++; -- cgit v1.2.3 From 7f43add451d2a0d235074b72d254ae266a6a023f Mon Sep 17 00:00:00 2001 From: Wang Xiaoqiang Date: Fri, 15 Jan 2016 16:57:22 -0800 Subject: mm/mlock.c: change can_do_mlock return value type to boolean Since can_do_mlock only return 1 or 0, so make it boolean. No functional change. [akpm@linux-foundation.org: update declaration in mm.h] Signed-off-by: Wang Xiaoqiang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- mm/mlock.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d6ec55d8b25..f1cd22f2df1a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1100,7 +1100,7 @@ static inline bool shmem_mapping(struct address_space *mapping) } #endif -extern int can_do_mlock(void); +extern bool can_do_mlock(void); extern int user_shm_lock(size_t, struct user_struct *); extern void user_shm_unlock(size_t, struct user_struct *); diff --git a/mm/mlock.c b/mm/mlock.c index 9197b6721a1e..e1e2b1207bf2 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -24,13 +24,13 @@ #include "internal.h" -int can_do_mlock(void) +bool can_do_mlock(void) { if (rlimit(RLIMIT_MEMLOCK) != 0) - return 1; + return true; if (capable(CAP_IPC_LOCK)) - return 1; - return 0; + return true; + return false; } EXPORT_SYMBOL(can_do_mlock); -- cgit v1.2.3