summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2019-09-25z3fold: fix memory leak in kmem cacheVitaly Wool1-6/+9
Currently there is a leak in init_z3fold_page() -- it allocates handles from kmem cache even for headless pages, but then they are never used and never freed, so eventually kmem cache may get exhausted. This patch provides a fix for that. Link: http://lkml.kernel.org/r/20190917185352.44cf285d3ebd9e64548de5de@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Tested-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: silence -Woverride-init/initializer-overridesQian Cai1-0/+3
When compiling a kernel with W=1, there are several of those warnings due to arm64 overriding a field on purpose. Just disable those warnings for both GCC and Clang of this file, so it will help dig "gems" hidden in the W=1 warnings by reducing some noises. mm/init-mm.c:39:2: warning: initializer overrides prior initialization of this subobject [-Winitializer-overrides] INIT_MM_CONTEXT(init_mm) ^~~~~~~~~~~~~~~~~~~~~~~~ ./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro 'INIT_MM_CONTEXT' .pgd = init_pg_dir, ^~~~~~~~~~~ mm/init-mm.c:30:10: note: previous initialization is here .pgd = swapper_pg_dir, ^~~~~~~~~~~~~~ Note: there is a side project trying to support explicitly allowing specific initializer overrides in Clang, but there is no guarantee it will happen or not. https://github.com/ClangBuiltLinux/linux/issues/639 Link: http://lkml.kernel.org/r/1566920867-27453-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: use CPU_BITS_NONE to initialize init_mm.cpu_bitmaskMike Rapoport1-1/+1
Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with neat CPU_BITS_NONE macro. And, since init_mm.cpu_bitmask is statically set to zero, there is no way to clear it again in start_kernel(). Link: http://lkml.kernel.org/r/1565703815-8584-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/vmalloc.c: move 'area->pages' after if statementAustin Kim1-3/+5
If !area->pages statement is true where memory allocation fails, area is freed. In this case 'area->pages = pages' should not executed. So move 'area->pages = pages' after if statement. [akpm@linux-foundation.org: give area->pages the same treatment] Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15 Signed-off-by: Austin Kim <austindh.kim@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/vmalloc: modify struct vmap_area to reduce its sizePengfei Li1-14/+10
Objective --------- The current implementation of struct vmap_area wasted space. After applying this commit, sizeof(struct vmap_area) has been reduced from 11 words to 8 words. Description ----------- 1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem because A) "subtree_max_size" is only used when vmap_area is in "free" tree B) "vm" is only used when vmap_area is in "busy" tree C) "purge_list" is only used when vmap_area is in vmap_purge_list 2) Eliminate "flags". ;Since only one flag VM_VM_AREA is being used, and the same thing can be done by judging whether "vm" is NULL, then the "flags" can be eliminated. Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/vmalloc: do not keep unpurged areas in the busy treeUladzislau Rezki (Sony)1-8/+44
The busy tree can be quite big, even though the area is freed or unmapped it still stays there until "purge" logic removes it. 1) Optimize and reduce the size of "busy" tree by removing a node from it right away as soon as user triggers free paths. It is possible to do so, because the allocation is done using another augmented tree. The vmalloc test driver shows the difference, for example the "fix_size_alloc_test" is ~11% better comparing with default configuration: sudo ./test_vmalloc.sh performance <default> Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec <default> <this patch> Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec <this patch> 2) Since the busy tree now contains allocated areas only and does not interfere with lazily free nodes, introduce the new function show_purge_info() that dumps "unpurged" areas that is propagated through "/proc/vmallocinfo". 3) Eliminate VM_LAZY_FREE flag. Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/sparse.c: remove NULL check in clear_hwpoisoned_pages()Alastair D'Silva1-3/+0
There is no possibility for memmap to be NULL in the current codebase. This check was added in commit 95a4774d055c ("memory-hotplug: update mce_bad_pages when removing the memory") where memmap was originally inited to NULL, and only conditionally given a value. The code that could have passed a NULL has been removed by commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no longer a possibility that memmap can be NULL. Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Baoquan He <bhe@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/sparse.c: don't manually decrement num_poisoned_pagesAlastair D'Silva1-1/+3
Use the function written to do it instead. Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/sparse.c: use __nr_to_section(section_nr) to get mem_sectionWei Yang1-1/+1
__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)). Since we already get section_nr, it is not necessary to get mem_section from start_pfn. By doing so, we reduce one redundant operation. Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc()Lecopzer Chen1-1/+1
The size argument passed into sparse_buffer_alloc() has already been aligned with PAGE_SIZE or PMD_SIZE. If the size after aligned is not power of 2 (e.g. 0x480000), the PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf up to next multiple of size. Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/sparse.c: fix memory leak of sparsemap_buf in aligned memoryLecopzer Chen1-2/+12
sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf after being aligned with the size. However, the size is at least PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger than PAGE_SIZE. Also, sparse_buffer_fini() only frees memory between sparsemap_buf and sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN() first, the aligned space before sparsemap_buf is wasted and no one will touch it. In our ARM32 platform (without SPARSEMEM_VMEMMAP) Sparse_buffer_init Reserve d359c000 - d3e9c000 (9M) Sparse_buffer_alloc Alloc d3a00000 - d3E80000 (4.5M) Sparse_buffer_fini Free d3e80000 - d3e9c000 (~=100k) The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed. In ARM64 platform (with SPARSEMEM_VMEMMAP) sparse_buffer_init Reserve ffffffc07d623000 - ffffffc07f623000 (32M) Sparse_buffer_alloc Alloc ffffffc07d800000 - ffffffc07f600000 (30M) Sparse_buffer_fini Free ffffffc07f600000 - ffffffc07f623000 (140K) The reserved memory between ffffffc07d623000 - ffffffc07d800000 (~=1.9M) is unfreed. Let's explicit free redundant aligned memory. [arnd@arndb.de: mark sparse_buffer_free as __meminit] Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug.c: s/is/ifSouptick Joarder1-1/+1
Correct typo in comment. Link: http://lkml.kernel.org/r/1568233954-3913-1-git-send-email-jrdr.linux@gmail.com Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug: online_pages cannot be 0 in online_pages()David Hildenbrand1-13/+9
walk_system_ram_range() will fail with -EINVAL in case online_pages_range() was never called (== no resource applicable in the range). Otherwise, we will always call online_pages_range() with nr_pages > 0 and, therefore, have online_pages > 0. Remove that special handling. Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug: make sure the pfn is aligned to the order when onliningDavid Hildenbrand1-0/+3
Commit a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as higher order") assumed that any PFN we get via memory resources is aligned to to MAX_ORDER - 1, I am not convinced that is always true. Let's play safe, check the alignment and fallback to single pages. akpm: warn in this situation so we get to find out if and why this ever occurs. [akpm@linux-foundation.org: add WARN_ON_ONCE()] Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug: simplify online_pages_range()David Hildenbrand1-20/+16
online_pages always corresponds to nr_pages. Simplify the code, getting rid of online_pages_blocks(). Add some comments. Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug: drop PageReserved() check in online_pages_range()David Hildenbrand1-3/+1
move_pfn_range_to_zone() will set all pages to PG_reserved via memmap_init_zone(). The only way a page could no longer be reserved would be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not done (the online_page callback is used for that purpose by e.g., Hyper-V instead). walk_system_ram_range() will never call online_pages_range() with duplicate PFNs, so drop the PageReserved() check. This seems to be a leftover from ancient times where the memmap was initialized when adding memory and we wanted to check for already onlined memory. Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug.c: prevent memory leak when reusing pgdatWei Yang1-1/+9
When offlining a node in try_offline_node(), pgdat is not released. So that pgdat could be reused in hotadd_new_pgdat(). While we reallocate pgdat->per_cpu_nodestats if this pgdat is reused. This patch prevents the memory leak by just allocating per_cpu_nodestats when it is a new pgdat. Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <OSalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25drivers/base/memory.c: don't store end_section_nr in memory blocksDavid Hildenbrand1-1/+1
Each memory block spans the same amount of sections/pages/bytes. The size is determined before the first memory block is created. No need to store what we can easily calculate - and the calculations even look simpler now. Michal brought up the idea of variable-sized memory blocks. However, if we ever implement something like this, we will need an API compatibility switch and reworks at various places (most code assumes a fixed memory block size). So let's cleanup what we have right now. While at it, fix the variable naming in register_mem_sect_under_node() - we no longer talk about a single section. Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/memory_hotplug: remove move_pfn_range()David Hildenbrand1-16/+8
Let's remove this indirection. We need the zone in the caller either way, so let's just detect it there. Add some documentation for move_pfn_range_to_zone() instead. [akpm@linux-foundation.org: restore newline, per David] Link: http://lkml.kernel.org/r/20190724142324.3686-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: do not hash address in print_bad_pte()Kefeng Wang1-1/+1
Using %px to show the actual address in print_bad_pte() to help us to debug issue. Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: remove quicklist page table cachesNicholas Piggin4-111/+0
Patch series "mm: remove quicklist page table caches". A while ago Nicholas proposed to remove quicklist page table caches [1]. I've rebased his patch on the curren upstream and switched ia64 and sh to use generic versions of PTE allocation. [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com This patch (of 3): Remove page table allocator "quicklists". These have been around for a long time, but have not got much traction in the last decade and are only used on ia64 and sh architectures. The numbers in the initial commit look interesting but probably don't apply anymore. If anybody wants to resurrect this it's in the git history, but it's unhelpful to have this code and divergent allocator behaviour for minor archs. Also it might be better to instead make more general improvements to page allocator if this is still so slow. Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: release the spinlock on zap_pte_rangeMinchan Kim1-2/+8
In our testing (camera recording), Miguel and Wei found unmap_page_range() takes above 6ms with preemption disabled easily. When I see that, the reason is it holds page table spinlock during entire 512 page operation in a PMD. 6.2ms is never trivial for user experince if RT task couldn't run in the time because it could make frame drop or glitch audio problem. I had a time to benchmark it via adding some trace_printk hooks between pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing device is 2018 premium mobile device. I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the task runs on little core even though it doesn't have any IPI and LRU lock contention. It's already too heavy. If I remove activate_page, 35-40% overhead of zap_pte_range is gone so most of overhead(about 0.7ms) comes from activate_page via mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could accumulate up to several ms. So this patch adds a check for need_resched() in the loop, and a preemption point if necessary. Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Miguel de Dios <migueldedios@google.com> Reported-by: Wei Wang <wvw@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: remove redundant assignment of entryWei Yang1-1/+0
Since ptent will not be changed after previous assignment of entry, it is not necessary to do the assignment again. Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/gup: add make_dirty arg to put_user_pages_dirty_lock()akpm@linux-foundation.org1-65/+50
[11~From: John Hubbard <jhubbard@nvidia.com> Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock() Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()", v3. There are about 50+ patches in my tree [2], and I'll be sending out the remaining ones in a few more groups: * The block/bio related changes (Jerome mostly wrote those, but I've had to move stuff around extensively, and add a little code) * mm/ changes * other subsystem patches * an RFC that shows the current state of the tracking patch set. That can only be applied after all call sites are converted, but it's good to get an early look at it. This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). This patch (of 3): Provide more capable variation of put_user_pages_dirty_lock(), and delete put_user_pages_dirty(). This is based on the following: 1. Lots of call sites become simpler if a bool is passed into put_user_page*(), instead of making the call site choose which put_user_page*() variant to call. 2. Christoph Hellwig's observation that set_page_dirty_lock() is usually correct, and set_page_dirty() is usually a bug, or at least questionable, within a put_user_page*() calling chain. This leads to the following API choices: * put_user_pages_dirty_lock(page, npages, make_dirty) * There is no put_user_pages_dirty(). You have to hand code that, in the rare case that it's required. [jhubbard@nvidia.com: remove unused variable in siw_free_plist()] Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: vmscan: do not share cgroup iteration between reclaimersJohannes Weiner1-20/+2
One of our services observed a high rate of cgroup OOM kills in the presence of large amounts of clean cache. Debugging showed that the culprit is the shared cgroup iteration in page reclaim. Under high allocation concurrency, multiple threads enter reclaim at the same time. Fearing overreclaim when we first switched from the single global LRU to cgrouped LRU lists, we introduced a shared iteration state for reclaim invocations - whether 1 or 20 reclaimers are active concurrently, we only walk the cgroup tree once: the 1st reclaimer reclaims the first cgroup, the second the second one etc. With more reclaimers than cgroups, we start another walk from the top. This sounded reasonable at the time, but the problem is that reclaim concurrency doesn't scale with allocation concurrency. As reclaim concurrency increases, the amount of memory individual reclaimers get to scan gets smaller and smaller. Individual reclaimers may only see one cgroup per cycle, and that may not have much reclaimable memory. We see individual reclaimers declare OOM when there is plenty of reclaimable memory available in cgroups they didn't visit. This patch does away with the shared iterator, and every reclaimer is allowed to scan the full cgroup tree and see all of reclaimable memory, just like it would on a non-cgrouped system. This way, when OOM is declared, we know that the reclaimer actually had a chance. To still maintain fairness in reclaim pressure, disallow cgroup reclaim from bailing out of the tree walk early. Kswapd and regular direct reclaim already don't bail, so it's not clear why limit reclaim would have to, especially since it only walks subtrees to begin with. This change completely eliminates the OOM kills on our service, while showing no signs of overreclaim - no increased scan rates, %sys time, or abrupt free memory spikes. I tested across 100 machines that have 64G of RAM and host about 300 cgroups each. [ It's possible overreclaim never was a *practical* issue to begin with - it was simply a concern we had on the mailing lists at the time, with no real data to back it up. But we have also added more bail-out conditions deeper inside reclaim (e.g. the proportional exit in shrink_node_memcg) since. Regardless, now we have data that suggests full walks are more reliable and scale just fine. ] Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: memcontrol: switch to rcu protection in drain_all_stock()Roman Gushchin1-8/+9
Commit 72f0184c8a00 ("mm, memcg: remove hotplug locking from try_charge") introduced css_tryget()/css_put() calls in drain_all_stock(), which are supposed to protect the target memory cgroup from being released during the mem_cgroup_is_descendant() call. However, it's not completely safe. In theory, memcg can go away between reading stock->cached pointer and calling css_tryget(). This can happen if drain_all_stock() races with drain_local_stock() performed on the remote cpu as a result of a work, scheduled by the previous invocation of drain_all_stock(). The race is a bit theoretical and there are few chances to trigger it, but the current code looks a bit confusing, so it makes sense to fix it anyway. The code looks like as if css_tryget() and css_put() are used to protect stocks drainage. It's not necessary because stocked pages are holding references to the cached cgroup. And it obviously won't work for works, scheduled on other cpus. So, let's read the stock->cached pointer and evaluate the memory cgroup inside a rcu read section, and get rid of css_tryget()/css_put() calls. Link: http://lkml.kernel.org/r/20190802192241.3253165-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, memcg: throttle allocators when failing reclaim over memory.highChris Down1-1/+125
We're trying to use memory.high to limit workloads, but have found that containment can frequently fail completely and cause OOM situations outside of the cgroup. This happens especially with swap space -- either when none is configured, or swap is full. These failures often also don't have enough warning to allow one to react, whether for a human or for a daemon monitoring PSI. Here is output from a simple program showing how long it takes in usec (column 2) to allocate a megabyte of anonymous memory (column 1) when a cgroup is already beyond its memory high setting, and no swap is available: [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \ > --wait -t timeout 300 /root/mdf [...] 95 1035 96 1038 97 1000 98 1036 99 1048 100 1590 101 1968 102 1776 103 1863 104 1757 105 1921 106 1893 107 1760 108 1748 109 1843 110 1716 111 1924 112 1776 113 1831 114 1766 115 1836 116 1588 117 1912 118 1802 119 1857 120 1731 [...] [System OOM in 2-3 seconds] The delay does go up extremely marginally past the 100MB memory.high threshold, as now we spend time scanning before returning to usermode, but it's nowhere near enough to contain growth. It also doesn't get worse the more pages you have, since it only considers nr_pages. The current situation goes against both the expectations of users of memory.high, and our intentions as cgroup v2 developers. In cgroup-v2.txt, we claim that we will throttle and only under "extreme conditions" will memory.high protection be breached. Likewise, cgroup v2 users generally also expect that memory.high should throttle workloads as they exceed their high threshold. However, as seen above, this isn't always how it works in practice -- even on banal setups like those with no swap, or where swap has become exhausted, we can end up with memory.high being breached and us having no weapons left in our arsenal to combat runaway growth with, since reclaim is futile. It's also hard for system monitoring software or users to tell how bad the situation is, as "high" events for the memcg may in some cases be benign, and in others be catastrophic. The current status quo is that we fail containment in a way that doesn't provide any advance warning that things are about to go horribly wrong (for example, we are about to invoke the kernel OOM killer). This patch introduces explicit throttling when reclaim is failing to keep memcg size contained at the memory.high setting. It does so by applying an exponential delay curve derived from the memcg's overage compared to memory.high. In the normal case where the memcg is either below or only marginally over its memory.high setting, no throttling will be performed. This composes well with system health monitoring and remediation, as these allocator delays are factored into PSI's memory pressure calculations. This both creates a mechanism system administrators or applications consuming the PSI interface to trivially see that the memcg in question is struggling and use that to make more reasonable decisions, and permits them enough time to act. Either of these can act with significantly more nuance than that we can provide using the system OOM killer. This is a similar idea to memory.oom_control in cgroup v1 which would put the cgroup to sleep if the threshold was violated, but it's also significantly improved as it results in visible memory pressure, and also doesn't schedule indefinitely, which previously made tracing and other introspection difficult (ie. it's clamped at 2*HZ per allocation through MEMCG_MAX_HIGH_DELAY_JIFFIES). Contrast the previous results with a kernel with this patch: [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \ > --wait -t timeout 300 /root/mdf [...] 95 1002 96 1000 97 1002 98 1003 99 1000 100 1043 101 84724 102 330628 103 610511 104 1016265 105 1503969 106 2391692 107 2872061 108 3248003 109 4791904 110 5759832 111 6912509 112 8127818 113 9472203 114 12287622 115 12480079 116 14144008 117 15808029 118 16384500 119 16383242 120 16384979 [...] As you can see, in the normal case, memory allocation takes around 1000 usec. However, as we exceed our memory.high, things start to increase exponentially, but fairly leniently at first. Our first megabyte over memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the next is almost an entire second. This gets worse until we reach our eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte. However, this is still making forward progress, so permits tracing or further analysis with programs like GDB. We use an exponential curve for our delay penalty for a few reasons: 1. We run mem_cgroup_handle_over_high to potentially do reclaim after we've already performed allocations, which means that temporarily going over memory.high by a small amount may be perfectly legitimate, even for compliant workloads. We don't want to unduly penalise such cases. 2. An exponential curve (as opposed to a static or linear delay) allows ramping up memory pressure stats more gradually, which can be useful to work out that you have set memory.high too low, without destroying application performance entirely. This patch expands on earlier work by Johannes Weiner. Thanks! [akpm@linux-foundation.org: fix max() warning] [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit] [akpm@linux-foundation.org: fix it even more] [chris@chrisdown.name: fix 64-bit divide even more] Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: page cache: store only head pages in i_pagesMatthew Wilcox (Oracle)7-96/+85
Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ Kirill and Huang Ying contributed several fixes. [willy@infradead.org: use compound_nr, squish uninit-var warning] Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/filemap.c: rewrite mapping_needs_writeback in less fancy mannerKonstantin Khlebnikov1-2/+5
This actually checks that writeback is needed or in progress. Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/filemap.c: don't initiate writeback if mapping has no dirty pagesKonstantin Khlebnikov1-1/+2
Functions like filemap_write_and_wait_range() should do nothing if inode has no dirty pages or pages currently under writeback. But they anyway construct struct writeback_control and this does some atomic operations if CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and updates state of writeback ownership, on slow path might be more work. Current this path is safely avoided only when inode mapping has no pages. For example generic_file_read_iter() calls filemap_write_and_wait_range() at each O_DIRECT read - pretty hot path. This patch skips starting new writeback if mapping has no dirty tags set. If writeback is already in progress filemap_write_and_wait_range() will wait for it. Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, page_owner, debug_pagealloc: save and dump freeing stack traceVlastimil Babka2-14/+43
The debug_pagealloc functionality is useful to catch buggy page allocator users that cause e.g. use after free or double free. When page inconsistency is detected, debugging is often simpler by knowing the call stack of process that last allocated and freed the page. When page_owner is also enabled, we record the allocation stack trace, but not freeing. This patch therefore adds recording of freeing process stack trace to page owner info, if both page_owner and debug_pagealloc are configured and enabled. With only page_owner enabled, this info is not useful for the memory leak debugging use case. dump_page() is adjusted to print the info. An example result of calling __free_pages() twice may look like this (note the page last free stack trace): BUG: Bad page state in process bash pfn:13d8f8 page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1affff800000000() raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 page dumped because: nonzero _refcount page_owner tracks the page as freed page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL) prep_new_page+0x143/0x150 get_page_from_freelist+0x289/0x380 __alloc_pages_nodemask+0x13c/0x2d0 khugepaged+0x6e/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 page last free stack trace: free_pcp_prepare+0x134/0x1e0 free_unref_page+0x18/0x90 khugepaged+0x7b/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 Modules linked in: CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x85/0xc0 bad_page.cold+0xba/0xbf rmqueue_pcplist.isra.0+0x6c5/0x6d0 rmqueue+0x2d/0x810 get_page_from_freelist+0x191/0x380 __alloc_pages_nodemask+0x13c/0x2d0 __get_free_pages+0xd/0x30 __pud_alloc+0x2c/0x110 copy_page_range+0x4f9/0x630 dup_mmap+0x362/0x480 dup_mm+0x68/0x110 copy_process+0x19e1/0x1b40 _do_fork+0x73/0x310 __x64_sys_clone+0x75/0x80 do_syscall_64+0x6e/0x1e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f10af854a10 ... Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, page_owner: keep owner info when freeing the pageVlastimil Babka1-10/+24
For debugging purposes it might be useful to keep the owner info even after page has been freed, and include it in e.g. dump_page() when detecting a bad page state. For that, change the PAGE_EXT_OWNER flag meaning to "page owner info has been set at least once" and add new PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be currently tracked allocated or free. Adjust dump_page() accordingly, distinguishing free and allocated pages. In the page_owner debugfs file, keep printing only allocated pages so that existing scripts are not confused, and also because free pages are irrelevant for the memory statistics or leak detection that's the typical use case of the file, anyway. Link: http://lkml.kernel.org/r/20190820131828.22684-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, page_owner: record page owner for each subpageVlastimil Babka1-12/+28
Patch series "debug_pagealloc improvements through page_owner", v2. The debug_pagealloc functionality serves a similar purpose on the page allocator level that slub_debug does on the kmalloc level, which is to detect bad users. One notable feature that slub_debug has is storing stack traces of who last allocated and freed the object. On page level we track allocations via page_owner, but that info is discarded when freeing, and we don't track freeing at all. This series improves those aspects. With both debug_pagealloc and page_owner enabled, we can then get bug reports such as the example in Patch 4. SLUB debug tracking additionally stores cpu, pid and timestamp. This could be added later, if deemed useful enough to justify the additional page_ext structure size. This patch (of 3): Currently, page owner info is only recorded for the first page of a high-order allocation, and copied to tail pages in the event of a split page. With the plan to keep previous owner info after freeing the page, it would be benefical to record page owner for each subpage upon allocation. This increases the overhead for high orders, but that should be acceptable for a debugging option. The order stored for each subpage is the order of the whole allocation. This makes it possible to calculate the "head" pfn and to recognize "tail" pages (quoted because not all high-order allocations are compound pages with true head and tail pages). When reading the page_owner debugfs file, keep skipping the "tail" pages so that stats gathered by existing scripts don't get inflated. Link: http://lkml.kernel.org/r/20190820131828.22684-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: replace list_move_tail() with add_page_to_lru_list_tail()Yu Zhao1-8/+6
This is a cleanup patch that replaces two historical uses of list_move_tail() with relatively recent add_page_to_lru_list_tail(). Link: http://lkml.kernel.org/r/20190716212436.7137-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: introduce compound_nr()Matthew Wilcox (Oracle)14-21/+20
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: introduce page_size()Matthew Wilcox (Oracle)6-22/+17
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/rmap.c: remove set but not used variable 'cstart'YueHaibing1-3/+1
Fixes gcc '-Wunused-but-set-variable' warning: mm/rmap.c: In function page_mkclean_one: mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable] It is not used any more since commit cdb07bdea28e ("mm/rmap.c: remove redundant variable cend") Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/page_poison.c: fix a typo in a commentChristophe JAILLET1-1/+1
s/posioned/poisoned/ Link: http://lkml.kernel.org/r/20190721180908.6534-1-christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25kasan: add memory corruption identification for software tag-based modeWalter Wu4-13/+91
Add memory corruption identification at bug report for software tag-based mode. The report shows whether it is "use-after-free" or "out-of-bound" error instead of "invalid-access" error. This will make it easier for programmers to see the memory corruption problem. We extend the slab to store five old free pointer tag and free backtrace, we can check if the tagged address is in the slab record and make a good guess if the object is more like "use-after-free" or "out-of-bound". therefore every slab memory corruption can be identified whether it's "use-after-free" or "out-of-bound". [aryabinin@virtuozzo.com: simplify & clenup code] Link: https://lkml.kernel.org/r/3318f9d7-a760-3cc8-b700-f06108ae745f@virtuozzo.com] Link: http://lkml.kernel.org/r/20190821180332.11450-1-aryabinin@virtuozzo.com Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/kmemleak.c: record the current memory pool sizeQian Cai1-1/+2
The only way to obtain the current memory pool size for a running kernel is to check the kernel config file which is inconvenient. Record it in the kernel messages. [akpm@linux-foundation.org: s/memory pool size/memory pool/available/, per Catalin] Link: http://lkml.kernel.org/r/1565809631-28933-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: kmemleak: use the memory pool for early allocationsCatalin Marinas1-239/+26
Currently kmemleak uses a static early_log buffer to trace all memory allocation/freeing before the slab allocator is initialised. Such early log is replayed during kmemleak_init() to properly initialise the kmemleak metadata for objects allocated up that point. With a memory pool that does not rely on the slab allocator, it is possible to skip this early log entirely. In order to remove the early logging, consider kmemleak_enabled == 1 by default while the kmem_cache availability is checked directly on the object_cache and scan_area_cache variables. The RCU callback is only invoked after object_cache has been initialised as we wouldn't have any concurrent list traversal before this. In order to reduce the number of callbacks before kmemleak is fully initialised, move the kmemleak_init() call to mm_init(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: remove WARN_ON(), per Catalin] Link: http://lkml.kernel.org/r/20190812160642.52134-4-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: kmemleak: simple memory allocation pool for kmemleak objectsCatalin Marinas1-2/+52
Add a memory pool for struct kmemleak_object in case the normal kmem_cache_alloc() fails under the gfp constraints passed by the caller. The mem_pool[] array size is currently fixed at 16000. We are not using the existing mempool kernel API since this requires the slab allocator to be available (for pool->elements allocation). A subsequent kmemleak patch will replace the static early log buffer with the pool allocation introduced here and this functionality is required to be available before the slab was initialised. Link: http://lkml.kernel.org/r/20190812160642.52134-3-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: kmemleak: make the tool tolerant to struct scan_area allocation failuresCatalin Marinas1-6/+10
Patch series "mm: kmemleak: Use a memory pool for kmemleak object allocations", v3. Following the discussions on v2 of this patch(set) [1], this series takes slightly different approach: - it implements its own simple memory pool that does not rely on the slab allocator - drops the early log buffer logic entirely since it can now allocate metadata from the memory pool directly before kmemleak is fully initialised - CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option is renamed to CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE - moves the kmemleak_init() call earlier (mm_init()) - to avoid a separate memory pool for struct scan_area, it makes the tool robust when such allocations fail as scan areas are rather an optimisation [1] http://lkml.kernel.org/r/20190727132334.9184-1-catalin.marinas@arm.com This patch (of 3): Object scan areas are an optimisation aimed to decrease the false positives and slightly improve the scanning time of large objects known to only have a few specific pointers. If a struct scan_area fails to allocate, kmemleak can still function normally by scanning the full object. Introduce an OBJECT_FULL_SCAN flag and mark objects as such when scan_area allocation fails. Link: http://lkml.kernel.org/r/20190812160642.52134-2-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm/slub.c: fix -Wunused-function compiler warningsQian Cai1-0/+2
tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure() when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang will complain that those unused functions. Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, slab: move memcg_cache_params structure to mm/slab.hWaiman Long1-0/+63
The memcg_cache_params structure is only embedded into the kmem_cache of slab and slub allocators as defined in slab_def.h and slub_def.h and used internally by mm code. There is no needed to expose it in a public header. So move it from include/linux/slab.h to mm/slab.h. It is just a refactoring patch with no code change. In fact both the slub_def.h and slab_def.h should be moved into the mm directory as well, but that will probably cause many merge conflicts. Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm, slab: extend slab/shrink to shrink all memcg cachesWaiman Long3-1/+39
Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink file to shrink the slab by flushing out all the per-cpu slabs and free slabs in partial lists. This can be useful to squeeze out a bit more memory under extreme condition as well as making the active object counts in /proc/slabinfo more accurate. This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if memcg sysfs is turned on, it is too cumbersome and impractical to manage all those per-memcg sysfs files in a real production system. So there is no practical way to shrink memcg caches. Fix this by enabling a proper write to the shrink sysfs file of the root cache to scan all the available memcg caches and shrink them as well. For a non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that cache will be shrunk when written. On a 2-socket 64-core 256-thread arm64 system with 64k page after a parallel kernel build, the the amount of memory occupied by slabs before shrinking slabs were: # grep task_struct /proc/slabinfo task_struct 53137 53192 4288 61 4 : tunables 0 0 0 : slabdata 872 872 0 # grep "^S[lRU]" /proc/meminfo Slab: 3936832 kB SReclaimable: 399104 kB SUnreclaim: 3537728 kB After shrinking slabs (by echoing "1" to all shrink files): # grep "^S[lRU]" /proc/meminfo Slab: 1356288 kB SReclaimable: 263296 kB SUnreclaim: 1092992 kB # grep task_struct /proc/slabinfo task_struct 2764 6832 4288 61 4 : tunables 0 0 0 : slabdata 112 112 0 Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25z3fold: fix retry mechanism in page reclaimVitaly Wool1-15/+34
z3fold_page_reclaim()'s retry mechanism is broken: on a second iteration it will have zhdr from the first one so that zhdr is no longer in line with struct page. That leads to crashes when the system is stressed. Fix that by moving zhdr assignment up. While at it, protect against using already freed handles by using own local slots structure in z3fold_page_reclaim(). Link: http://lkml.kernel.org/r/20190908162919.830388dc7404d1e2c80f4095@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Reported-by: Chris Murphy <bugzilla@colorremedies.com> Reported-by: Agustin Dall'Alba <agustin@dallalba.com.ar> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25Revert "mm/z3fold.c: fix race between migration and destruction"Vitaly Wool1-90/+0
With the original commit applied, z3fold_zpool_destroy() may get blocked on wait_event() for indefinite time. Revert this commit for the time being to get rid of this problem since the issue the original commit addresses is less severe. Link: http://lkml.kernel.org/r/20190910123142.7a9c8d2de4d0acbc0977c602@gmail.com Fixes: d776aaa9895eb6eb77 ("mm/z3fold.c: fix race between migration and destruction") Reported-by: Agustín Dall'Alba <agustin@dallalba.com.ar> Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-21Merge tag 'for-linus-hmm' of ↵Linus Torvalds12-734/+675
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...
2019-09-19Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds1-1/+4
Pull dma-mapping updates from Christoph Hellwig: - add dma-mapping and block layer helpers to take care of IOMMU merging for mmc plus subsequent fixups (Yoshihiro Shimoda) - rework handling of the pgprot bits for remapping (me) - take care of the dma direct infrastructure for swiotlb-xen (me) - improve the dma noncoherent remapping infrastructure (me) - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me) - cleanup mmaping of coherent DMA allocations (me) - various misc cleanups (Andy Shevchenko, me) * tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits) mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE mmc: queue: Fix bigger segments usage arm64: use asm-generic/dma-mapping.h swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page swiotlb-xen: simplify cache maintainance swiotlb-xen: use the same foreign page check everywhere swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable xen: remove the exports for xen_{create,destroy}_contiguous_region xen/arm: remove xen_dma_ops xen/arm: simplify dma_cache_maint xen/arm: use dev_is_dma_coherent xen/arm: consolidate page-coherent.h xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance arm: remove wrappers for the generic dma remap helpers dma-mapping: introduce a dma_common_find_pages helper dma-mapping: always use VM_DMA_COHERENT for generic DMA remap vmalloc: lift the arm flag for coherent mappings to common code dma-mapping: provide a better default ->get_required_mask dma-mapping: remove the dma_declare_coherent_memory export remoteproc: don't allow modular build ...