summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-03-13mm: vmpressure: fix sending wrong events on underflowVinayak Menon1-1/+9
commit e1587a4945408faa58d0485002c110eb2454740c upstream. At the end of a window period, if the reclaimed pages is greater than scanned, an unsigned underflow can result in a huge pressure value and thus a critical event. Reclaimed pages is found to go higher than scanned because of the addition of reclaimed slab pages to reclaimed in shrink_node without a corresponding increment to scanned pages. Minchan Kim mentioned that this can also happen in the case of a THP page where the scanned is 1 and reclaimed could be 512. Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Shiraz Hashim <shashim@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2017-02-16mm, fs: check for fatal signals in do_generic_file_read()Michal Hocko1-0/+5
commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream. do_generic_file_read() can be told to perform a large request from userspace. If the system is under OOM and the reading task is the OOM victim then it has an access to memory reserves and finishing the full request can lead to the full memory depletion which is dangerous. Make sure we rather go with a short read and allow the killed task to terminate. Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2017-02-16mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()Toshi Kani1-4/+8
commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream. Patch series "fix a kernel oops when reading sysfs valid_zones", v2. A sysfs memory file is created for each 2GiB memory block on x86-64 when the system has 64GiB or more memory. [1] When the start address of a memory block is not backed by struct page, i.e. a memory range is not aligned by 2GiB, reading its 'valid_zones' attribute file leads to a kernel oops. This issue was observed on multiple x86-64 systems with more than 64GiB of memory. This patch-set fixes this issue. Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not test the start section. Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone() to return valid [start, end). Note for stable kernels: The memory block size change was made by commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems"), which was accepted to 3.9. However, this patch-set depends on (and fixes) the change to test_pages_in_a_zone() made by commit 5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()"), which was accepted to 4.4. So, I recommend that we backport it up to 4.4. [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")' This patch (of 2): test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by section since 'sec_end_pfn' is set equal to 'pfn'. Since this function is called for testing the range of a sysfs memory file, 'start_pfn' is always aligned by section. Fix it by properly setting 'sec_end_pfn' to the next section pfn. Also make sure that this function returns 1 only when the range belongs to a zone. Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2017-01-26mm/hugetlb.c: fix reservation race when freeing surplus pagesMike Kravetz1-9/+28
commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream. return_unused_surplus_pages() decrements the global reservation count, and frees any unused surplus pages that were backing the reservation. Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") added a call to cond_resched_lock in the loop freeing the pages. As a result, the hugetlb_lock could be dropped, and someone else could use the pages that will be freed in subsequent iterations of the loop. This could result in inconsistent global hugetlb page state, application api failures (such as mmap) failures or application crashes. When dropping the lock in return_unused_surplus_pages, make sure that the global reservation count (resv_huge_pages) remains sufficiently large to prevent someone else from claiming pages about to be freed. Analyzed by Paul Cassella. Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Paul Cassella <cassella@cray.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2017-01-26mm/init: fix zone boundary creationOliver O'Halloran1-7/+10
commit 90cae1fe1c3540f791d5b8e025985fa5e699b2bb upstream. As a part of memory initialisation the architecture passes an array to free_area_init_nodes() which specifies the max PFN of each memory zone. This array is not necessarily monotonic (due to unused zones) so this array is parsed to build monotonic lists of the min and max PFN for each zone. ZONE_MOVABLE is special cased here as its limits are managed by the mm subsystem rather than the architecture. Unfortunately, this special casing is broken when ZONE_MOVABLE is the not the last zone in the zone list. The core of the issue is: if (i == ZONE_MOVABLE) continue; arch_zone_lowest_possible_pfn[i] = arch_zone_highest_possible_pfn[i-1]; As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will be set to zero. This patch fixes this bug by adding explicitly tracking where the next zone should start rather than relying on the contents arch_zone_highest_possible_pfn[]. Thie is low priority. To get bitten by this you need to enable a zone that appears after ZONE_MOVABLE in the zone_type enum. As far as I can tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled, so I can't see this affecting too many people. I only noticed this because I've been fiddling with ZONE_DEVICE on powerpc and 4.6 broke my test kernel. This bug, in conjunction with the changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and powerpc being the odd architecture which initialises max_zone_pfn[] to ~0ul instead of 0 caused all of system memory to be placed into ZONE_DEVICE at boot, followed a panic since device memory cannot be used for kernel allocations. I've already submitted a patch to fix the powerpc specific bits, but I figured this should be fixed too. Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2017-01-26mm/vmscan.c: set correct defer count for shrinkerShaohua Li1-3/+11
commit 5f33a0803bbd781de916f5c7448cbbbbc763d911 upstream. Our system uses significantly more slab memory with memcg enabled with the latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we have two memcg for one shrinker. In do_shrink_slab: 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024, then no memory is freed. nr_deferred = 700 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10 or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10. The deferred share of cg1 is lost in this case. kswapd will free no memory even run above steps again and again. The fix makes sure one memcg's deferred share isn't lost. Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-11-29mm: export symbol dependencies of is_zero_pfn()Ard Biesheuvel1-0/+2
commit 0b70068e47e8f0c813a902dc3d6def601fd15acb upstream. In order to make the static inline function is_zero_pfn() callable by modules, export its symbol dependencies 'zero_pfn' and (for s390 and mips) 'zero_page_mask'. We need this for KVM, as CONFIG_KVM is a tristate for all supported architectures except ARM and arm64, and testing a pfn whether it refers to the zero page is required to correctly distinguish the zero page from other special RAM ranges that may also have the PG_reserved bit set, but need to be treated as MMIO memory. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-11-29mm: filemap: update find_get_pages_tag() to deal with shadow entriesJohannes Weiner3-37/+40
commit 139b6a6fb1539e04b01663d61baff3088c63dbb5 upstream. Dave Jones reports the following crash when find_get_pages_tag() runs into an exceptional entry: kernel BUG at mm/filemap.c:1347! RIP: find_get_pages_tag+0x1cb/0x220 Call Trace: find_get_pages_tag+0x36/0x220 pagevec_lookup_tag+0x21/0x30 filemap_fdatawait_range+0xbe/0x1e0 filemap_fdatawait+0x27/0x30 sync_inodes_sb+0x204/0x2a0 sync_inodes_one_sb+0x19/0x20 iterate_supers+0xb2/0x110 sys_sync+0x44/0xb0 ia32_do_call+0x13/0x13 1343 /* 1344 * This function is never used on a shmem/tmpfs 1345 * mapping, so a swap entry won't be found here. 1346 */ 1347 BUG(); After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") this comment and BUG() are out of date because exceptional entries can now appear in all mappings - as shadows of recently evicted pages. However, as Hugh Dickins notes, "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably any other PAGECACHE_TAG_*) to appear on an exceptional entry. I expect it comes down to an occasional race in RCU lookup of the radix_tree: lacking absolute synchronization, we might sometimes catch an exceptional entry, with the tag which really belongs with the unexceptional entry which was there an instant before." And indeed, not only is the tree walk lockless, the tags are also read in chunks, one radix tree node at a time. There is plenty of time for page reclaim to swoop in and replace a page that was already looked up as tagged with a shadow entry. Remove the BUG() and update the comment. While reviewing all other lookup sites for whether they properly deal with shadow entries of evicted pages, update all the comments and fix memcg file charge moving to not miss shmem/tmpfs swapcache pages. Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-11-24swapfile: fix memory corruption via malformed swapfileJann Horn1-0/+2
commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream. When root activates a swap partition whose header has the wrong endianness, nr_badpages elements of badpages are swabbed before nr_badpages has been checked, leading to a buffer overrun of up to 8GB. This normally is not a security issue because it can only be exploited by root (more specifically, a process with CAP_SYS_ADMIN or the ability to modify a swap file/partition), and such a process can already e.g. modify swapped-out memory of any other userspace process on the system. Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-11-08mm/hugetlb: fix memory offline with hugepage size > memory block sizeGerald Schaefer1-6/+7
commit 2247bb335ab9c40058484cac36ea74ee652f3b7b upstream. Patch series "mm/hugetlb: memory offline issues with hugepages", v4. This addresses several issues with hugepages and memory offline. While the first patch fixes a panic, and is therefore rather important, the last patch is just a performance optimization. The second patch fixes a theoretical issue with reserved hugepages, while still leaving some ugly usability issue, see description. This patch (of 3): dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a list corruption and addressing exception when trying to set a memory block offline that is part (but not the first part) of a "gigantic" hugetlb page with a size > memory block size. When no other smaller hugetlb page sizes are present, the VM_BUG_ON() will trigger directly. In the other case we will run into an addressing exception later, because dissolve_free_huge_page() will not work on the head page of the compound hugetlb page which will result in a NULL hstate from page_hstate(). To fix this, first remove the VM_BUG_ON() because it is wrong, and then use the compound head page in dissolve_free_huge_page(). This means that an unused pre-allocated gigantic page that has any part of itself inside the memory block that is going offline will be dissolved completely. Losing an unused gigantic hugepage is preferable to failing the memory offline, for example in the situation where a (possibly faulty) memory DIMM needs to go offline. Changes for v4.4 stable: - make it apply w/o commit c1470b33 "mm/hugetlb: fix incorrect hugepages count during mem hotplug" Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-10-20mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds1-2/+12
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-10-07mm,ksm: fix endless looping in allocating memory when ksm enablezhong jiang1-1/+2
commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream. I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-08-19mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEEDAndrea Arcangeli1-2/+12
commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream. pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were introduced to locklessy (but atomically) detect when a pmd is a regular (stable) pmd or when the pmd is unstable and can infinitely transition from pmd_none() and pmd_trans_huge() from under us, while only holding the mmap_sem for reading (for writing not). While holding the mmap_sem only for reading, MADV_DONTNEED can run from under us and so before we can assume the pmd to be a regular stable pmd we need to compare it against pmd_none() and pmd_trans_huge() in an atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a tiny window for a race. Useful applications are unlikely to notice the difference as doing MADV_DONTNEED concurrently with a page fault would lead to undefined behavior. [js] 3.12 backport: no pmd_devmap in 3.12 yet. [akpm@linux-foundation.org: tidy up comment grammar/layout] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-08-19tmpfs: fix regression hang in fallocate undoHugh Dickins1-3/+5
commit 7f556567036cb7f89aabe2f0954b08566b4efb53 upstream. The well-spotted fallocate undo fix is good in most cases, but not when fallocate failed on the very first page. index 0 then passes lend -1 to shmem_undo_range(), and that has two bad effects: (a) that it will undo every fallocation throughout the file, unrestricted by the current range; but more importantly (b) it can cause the undo to hang, because lend -1 is treated as truncation, which makes it keep on retrying until every page has gone, but those already fully instantiated will never go away. Big thank you to xfstests generic/269 which demonstrates this. Fixes: b9b4bb26af01 ("tmpfs: don't undo fallocate past its last page") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-08-19tmpfs: don't undo fallocate past its last pageAnthony Romano1-1/+1
commit b9b4bb26af017dbe930cd4df7f9b2fc3a0497bfe upstream. When fallocate is interrupted it will undo a range that extends one byte past its range of allocated pages. This can corrupt an in-use page by zeroing out its first byte. Instead, undo using the inclusive byte range. Fixes: 1635f6a74152f1d ("tmpfs: undo fallocation on failure") Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com Signed-off-by: Anthony Romano <anthony.romano@coreos.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Brandon Philips <brandon@ifup.co> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-07-21mm: migrate dirty page without clear_page_dirty_for_io etcHugh Dickins1-20/+31
commit 42cb14b110a5698ccf26ce59c4441722605a3743 upstream. clear_page_dirty_for_io() has accumulated writeback and memcg subtleties since v2.6.16 first introduced page migration; and the set_page_dirty() which completed its migration of PageDirty, later had to be moderated to __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too. No actual problems seen with this procedure recently, but if you look into what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually achieving, it turns out to be nothing more than moving the PageDirty flag, and its NR_FILE_DIRTY stat from one zone to another. It would be good to avoid a pile of irrelevant decrementations and incrementations, and improper event counting, and unnecessary descent of the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which radix_tree_replace_slot() left in place anyway). Do the NR_FILE_DIRTY movement, like the other stats movements, while interrupts still disabled in migrate_page_move_mapping(); and don't even bother if the zone is the same. Do the PageDirty movement there under tree_lock too, where old page is frozen and newpage not yet visible: bearing in mind that as soon as newpage becomes visible in radix_tree, an un-page-locked set_page_dirty() might interfere (or perhaps that's just not possible: anything doing so should already hold an additional reference to the old page, preventing its migration; but play safe). But we do still need to transfer PageDirty in migrate_page_copy(), for those who don't go the mapping route through migrate_page_move_mapping(). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Charles (Chas) Williams <ciwillia@brocade.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-07-21mm/swap.c: flush lru pvecs on compound page arrivalLukasz Odzioba1-5/+4
commit 8f182270dfec432e93fae14f9208a6b9af01009f upstream. Currently we can have compound pages held on per cpu pagevecs, which leads to a lot of memory unavailable for reclaim when needed. In the systems with hundreads of processors it can be GBs of memory. On of the way of reproducing the problem is to not call munmap explicitly on all mapped regions (i.e. after receiving SIGTERM). After that some pages (with THP enabled also huge pages) may end up on lru_add_pvec, example below. void main() { #pragma omp parallel { size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0); if (p != MAP_FAILED) memset(p, 0, size); //munmap(p, size); // uncomment to make the problem go away } } When we run it with THP enabled it will leave significant amount of memory on lru_add_pvec. This memory will be not reclaimed if we hit OOM, so when we run above program in a loop: for i in `seq 100`; do ./a.out; done many processes (95% in my case) will be killed by OOM. The primary point of the LRU add cache is to save the zone lru_lock contention with a hope that more pages will belong to the same zone and so their addition can be batched. The huge page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like a safer option when compared to a potential excess in the caching which can be quite large and much harder to fix because lru_add_drain_all is way to expensive and it is not really clear what would be a good moment to call it. Similarly we can reproduce the problem on lru_deactivate_pvec by adding: madvise(p, size, MADV_FREE); after memset. This patch flushes lru pvecs on compound page arrival making the problem less severe - after applying it kill rate of above example drops to 0%, due to reducing maximum amount of memory held on pvec from 28MB (with THP) to 56kB per CPU. Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Ming Li <mingli199x@qq.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-07-21mm: Export migrate_page_move_mapping and migrate_page_copyRichard Weinberger1-0/+2
commit 1118dce773d84f39ebd51a9fe7261f9169cb056e upstream. Export these symbols such that UBIFS can implement ->migratepage. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-05-13mm/balloon_compaction: fix deflation when compaction is disabledKonstantin Khlebnikov1-0/+2
commit 4d88e6f7d5ffc84e6094a47925870f4a130555c2 upstream. If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages with balloon and doesn't set PagePrivate flag, as a result balloon_page_dequeue() cannot get any pages because it thinks that all of them are isolated. Without balloon compaction nobody can isolate ballooned pages. It's safe to remove this check. Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management"). Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Reported-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Gavin Guo <gavin.guo@canonical.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-05-13mm/balloon_compaction: redesign ballooned pages managementKonstantin Khlebnikov3-30/+18
commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2 upstream. Sasha Levin reported KASAN splash inside isolate_migratepages_range(). Problem is in the function __is_movable_balloon_page() which tests AS_BALLOON_MAP in page->mapping->flags. This function has no protection against anonymous pages. As result it tried to check address space flags inside struct anon_vma. Further investigation shows more problems in current implementation: * Special branch in __unmap_and_move() never works: balloon_page_movable() checks page flags and page_count. In __unmap_and_move() page is locked, reference counter is elevated, thus balloon_page_movable() always fails. As a result execution goes to the normal migration path. virtballoon_migratepage() returns MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS, move_to_new_page() thinks this is an error code and assigns newpage->mapping to NULL. Newly migrated page lose connectivity with balloon an all ability for further migration. * lru_lock erroneously required in isolate_migratepages_range() for isolation ballooned page. This function releases lru_lock periodically, this makes migration mostly impossible for some pages. * balloon_page_dequeue have a tight race with balloon_page_isolate: balloon_page_isolate could be executed in parallel with dequeue between picking page from list and locking page_lock. Race is rare because they use trylock_page() for locking. This patch fixes all of them. Instead of fake mapping with special flag this patch uses special state of page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark directly in struct page makes everything safer and easier. PagePrivate is used to mark pages present in page list (i.e. not isolated, like PageLRU for normal pages). It replaces special rules for reference counter and makes balloon migration similar to migration of normal pages. This flag is protected by page_lock together with link to the balloon device. [js] backport to 3.12. MIGRATEPAGE_BALLOON_SUCCESS had to be removed from one more place. VM_BUG_ON_PAGE does not exist in 3.12 yet, use plain VM_BUG_ON. Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Gavin Guo <gavin.guo@canonical.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-04-20mm: fix invalid node in alloc_migrate_target()Xishi Qiu1-4/+4
commit 6f25a14a7053b69917e2ebea0d31dd444cd31fd5 upstream. It is incorrect to use next_node to find a target node, it will return MAX_NUMNODES or invalid node. This will lead to crash in buddy system allocation. Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: "Laura Abbott" <lauraa@codeaurora.org> Cc: Hui Zhu <zhuhui@xiaomi.com> Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-04-11memcg: do not hang on OOM when killed by userspace OOM access to memory reservesMichal Hocko1-1/+2
commit d8dc595ce3909fbc131bdf5ab8c9808fe624b18d upstream. Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 > $GROUP/memory.oom_control His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut further charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. I am bringing this patch again (rebased on the current mmotm tree). I hope we can move forward finally. If there is still an opposition then I would really appreciate a concurrent approach so that we can discuss alternatives. http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference to the followup discussion when the patch has been dropped from the mmotm last time. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-25memcg: only free spare array when readers are doneMartijn Coenen1-5/+6
commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream. A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Signed-off-by: Martijn Coenen <maco@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-25mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()Andrew Banman1-12/+19
commit 5f0f2887f4de9508dcf438deab28f1de8070c271 upstream. test_pages_in_a_zone() does not account for the possibility of missing sections in the given pfn range. pfn_valid_within always returns 1 when CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing sections to pass the test, leading to a kernel oops. Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check for missing sections before proceeding into the zone-check code. This also prevents a crash from offlining memory devices with missing sections. Despite this, it may be a good idea to keep the related patch '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with missing sections' because missing sections in a memory block may lead to other problems not covered by the scope of this fix. Signed-off-by: Andrew Banman <abanman@sgi.com> Acked-by: Alex Thorlton <athorlton@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24mm: soft-offline: check return value in second __get_any_page() callNaoya Horiguchi1-1/+1
commit d96b339f453997f2f08c52da3f41423be48c978f upstream. I saw the following BUG_ON triggered in a testcase where a process calls madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that calls migratepages command repeatedly (doing ping-pong among different NUMA nodes) for the first process: Soft offlining page 0x60000 at 0x700000600000 __get_any_page: 0x60000 free buddy page page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1 flags: 0x1fffc0000000000() page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) ------------[ cut here ]------------ kernel BUG at /src/linux-dev/include/linux/mm.h:342! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000 RIP: 0010:[<ffffffff8118998c>] [<ffffffff8118998c>] put_page+0x5c/0x60 RSP: 0018:ffff88007c213e00 EFLAGS: 00010246 Call Trace: put_hwpoison_page+0x4e/0x80 soft_offline_page+0x501/0x520 SyS_madvise+0x6bc/0x6f0 entry_SYSCALL_64_fastpath+0x12/0x6a Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47 RIP [<ffffffff8118998c>] put_page+0x5c/0x60 RSP <ffff88007c213e00> The root cause resides in get_any_page() which retries to get a refcount of the page to be soft-offlined. This function calls put_hwpoison_page(), expecting that the target page is putback to LRU list. But it can be also freed to buddy. So the second check need to care about such case. Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn1-1/+1
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream. By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-12vmstat: explicitly schedule per-cpu work on the CPU we need it to run onLinus Torvalds1-1/+2
commit 176bed1de5bf977938cad26551969eca8f0883b1 upstream. The vmstat code uses "schedule_delayed_work_on()" to do the initial startup of the delayed work on the right CPU, but then once it was started it would use the non-cpu-specific "schedule_delayed_work()" to re-schedule it on that CPU. That just happened to schedule it on the same CPU historically (well, in almost all situations), but the code _requires_ this work to be per-cpu, and should say so explicitly rather than depend on the non-cpu-specific scheduling to schedule on the current CPU. The timer code is being changed to not be as single-minded in always running things on the calling CPU. See also commit 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") that for now maintains the local CPU guarantees just in case there are other broken users that depended on the accidental behavior. js: 3.12 backport Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mike Galbraith <mgalbraith@suse.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-11-09mm: make sendfile(2) killableJan Kara1-4/+5
commit 296291cdd1629c308114504b850dc343eabc2782 upstream. Currently a simple program below issues a sendfile(2) system call which takes about 62 days to complete in my test KVM instance. int fd; off_t off = 0; fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644); ftruncate(fd, 2); lseek(fd, 0, SEEK_END); sendfile(fd, fd, &off, 0xfffffff); Now you should not ask kernel to do a stupid stuff like copying 256MB in 2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin should have a way to stop you. We actually do have a check for fatal_signal_pending() in generic_perform_write() which triggers in this path however because we always succeed in writing something before the check is done, we return value > 0 from generic_perform_write() and thus the information about signal gets lost. Fix the problem by doing the signal check before writing anything. That way generic_perform_write() returns -EINTR, the error gets propagated up and the sendfile loop terminates early. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-10-28mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)Joonsoo Kim1-3/+10
commit 03a2d2a3eafe4015412cf4e9675ca0e2d9204074 upstream. Commit description is copied from the original post of this bug: http://comments.gmane.org/gmane.linux.kernel.mm/135349 Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next larger cache size than the size index INDEX_NODE mapping. In kernels 3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size. However, sometimes we can't get the right output we expected via kmalloc_size(INDEX_NODE + 1), causing a BUG(). The mapping table in the latest kernel is like: index = {0, 1, 2 , 3, 4, 5, 6, n} size = {0, 96, 192, 8, 16, 32, 64, 2^n} The mapping table before 3.10 is like this: index = {0 , 1 , 2, 3, 4 , 5 , 6, n} size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)} The problem on my mips64 machine is as follows: (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150", and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8. (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size = PAGE_SIZE". (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and "flags |= CFLGS_OFF_SLAB" will be covered. (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result here may be NULL while kernel bootup. (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the BUG info as the following shows (may be only mips64 has this problem): This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes the BUG by adding 'size >= 256' check to guarantee that all necessary small sized slabs are initialized regardless sequence of slab size in mapping table. Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-10-28mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a faultMel Gorman1-0/+8
commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream. SunDong reported the following on https://bugzilla.kernel.org/show_bug.cgi?id=103841 I think I find a linux bug, I have the test cases is constructed. I can stable recurring problems in fedora22(4.0.4) kernel version, arch for x86_64. I construct transparent huge page, when the parent and child process with MAP_SHARE, MAP_PRIVATE way to access the same huge page area, it has the opportunity to lead to huge page copy on write failure, and then it will munmap the child corresponding mmap area, but then the child mmap area with VM_MAYSHARE attributes, child process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags functions (vma - > vm_flags & VM_MAYSHARE). There were a number of problems with the report (e.g. it's hugetlbfs that triggers this, not transparent huge pages) but it was fundamentally correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that looks like this vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800 prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0 pgoff 0 file ffff88106bdb9800 private_data (null) flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb) ------------ kernel BUG at mm/hugetlb.c:462! SMP Modules linked in: xt_pkttype xt_LOG xt_limit [..] CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 set_vma_resv_flags+0x2d/0x30 The VM_BUG_ON is correct because private and shared mappings have different reservation accounting but the warning clearly shows that the VMA is shared. When a private COW fails to allocate a new page then only the process that created the VMA gets the page -- all the children unmap the page. If the children access that data in the future then they get killed. The problem is that the same file is mapped shared and private. During the COW, the allocation fails, the VMAs are traversed to unmap the other private pages but a shared VMA is found and the bug is triggered. This patch identifies such VMAs and skips them. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: SunDong <sund_sky@126.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-09-30vmscan: fix increasing nr_isolated incurred by putback unevictable pagesJaewon Kim1-1/+1
commit c54839a722a02818677bcabe57e957f0ce4f841d upstream. reclaim_clean_pages_from_list() assumes that shrink_page_list() returns number of pages removed from the candidate list. But shrink_page_list() puts back mlocked pages without passing it to caller and without counting as nr_reclaimed. This increases nr_isolated. To fix this, this patch changes shrink_page_list() to pass unevictable pages back to caller. Caller will take care those pages. Minchan said: It fixes two issues. 1. With unevictable page, cma_alloc will be successful. Exactly speaking, cma_alloc of current kernel will fail due to unevictable pages. 2. fix leaking of NR_ISOLATED counter of vmstat With it, too_many_isolated works. Otherwise, it could make hang until the process get SIGKILL. Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25mm/hwpoison: fix page refcount of unknown non LRU pageWanpeng Li1-0/+2
commit 4f32be677b124a49459e2603321c7a5605ceb9f8 upstream. After trying to drain pages from pagevec/pageset, we try to get reference count of the page again, however, the reference count of the page is not reduced if the page is still not on LRU list. Fix it by adding the put_page() to drop the page reference which is from __get_any_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25mm, vmscan: Do not wait for page writeback for GFP_NOFS allocationsMichal Hocko1-9/+5
commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. Nikolay has reported a hang when a memcg reclaim got stuck with the following backtrace: PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" #0 __schedule at ffffffff815ab152 #1 schedule at ffffffff815ab76e #2 schedule_timeout at ffffffff815ae5e5 #3 io_schedule_timeout at ffffffff815aad6a #4 bit_wait_io at ffffffff815abfc6 #5 __wait_on_bit at ffffffff815abda5 #6 wait_on_page_bit at ffffffff8111fd4f #7 shrink_page_list at ffffffff81135445 #8 shrink_inactive_list at ffffffff81135845 #9 shrink_lruvec at ffffffff81135ead #10 shrink_zone at ffffffff811360c3 #11 shrink_zones at ffffffff81136eff #12 do_try_to_free_pages at ffffffff8113712f #13 try_to_free_mem_cgroup_pages at ffffffff811372be #14 try_charge at ffffffff81189423 #15 mem_cgroup_try_charge at ffffffff8118c6f5 #16 __add_to_page_cache_locked at ffffffff8112137d #17 add_to_page_cache_lru at ffffffff81121618 #18 pagecache_get_page at ffffffff8112170b #19 grow_dev_page at ffffffff811c8297 #20 __getblk_slow at ffffffff811c91d6 #21 __getblk_gfp at ffffffff811c92c1 #22 ext4_ext_grow_indepth at ffffffff8124565c #23 ext4_ext_create_new_leaf at ffffffff81246ca8 #24 ext4_ext_insert_extent at ffffffff81246f09 #25 ext4_ext_map_blocks at ffffffff8124a848 #26 ext4_map_blocks at ffffffff8121a5b7 #27 mpage_map_one_extent at ffffffff8121b1fa #28 mpage_map_and_submit_extent at ffffffff8121f07b #29 ext4_writepages at ffffffff8121f6d5 #30 do_writepages at ffffffff8112c490 #31 __filemap_fdatawrite_range at ffffffff81120199 #32 filemap_flush at ffffffff8112041c #33 ext4_alloc_da_blocks at ffffffff81219da1 #34 ext4_rename at ffffffff81229b91 #35 ext4_rename2 at ffffffff81229e32 #36 vfs_rename at ffffffff811a08a5 #37 SYSC_renameat2 at ffffffff811a3ffc #38 sys_renameat2 at ffffffff811a408e #39 sys_rename at ffffffff8119e51e #40 system_call_fastpath at ffffffff815afa89 Dave Chinner has properly pointed out that this is a deadlock in the reclaim code because ext4 doesn't submit pages which are marked by PG_writeback right away. The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM with too many dirty pages") and it was applied only when may_enter_fs was specified. The code has been changed by c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") which has removed the __GFP_FS restriction with a reasoning that we do not get into the fs code. But this is not sufficient apparently because the fs doesn't necessarily submit pages marked PG_writeback for IO right away. ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily submit the bio. Instead it tries to map more pages into the bio and mpage_map_one_extent might trigger memcg charge which might end up waiting on a page which is marked PG_writeback but hasn't been submitted yet so we would end up waiting for something that never finishes. Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) before we go to wait on the writeback. The page fault path, which is the only path that triggers memcg oom killer since 3.12, shouldn't require GFP_NOFS and so we shouldn't reintroduce the premature OOM killer issue which was originally addressed by the heuristic. As per David Chinner the xfs is doing similar thing since 2.6.15 already so ext4 is not the only affected filesystem. Moreover he notes: : For example: IO completion might require unwritten extent conversion : which executes filesystem transactions and GFP_NOFS allocations. The : writeback flag on the pages can not be cleared until unwritten : extent conversion completes. Hence memory reclaim cannot wait on : page writeback to complete in GFP_NOFS context because it is not : safe to do so, memcg reclaim or otherwise. [tytso@mit.edu: corrected the control flow] Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-19mm: avoid setting up anonymous pages into file mappingKirill A. Shutemov1-4/+9
commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream. Reading page fault handler code I've noticed that under right circumstances kernel would map anonymous pages into file mappings: if the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated on ->mmap(), kernel would handle page fault to not populated pte with do_anonymous_page(). Let's change page fault handler to use do_anonymous_page() only on anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not shared. For file mappings without vm_ops->fault() or shred VMA without vm_ops, page fault on pte_none() entry would lead to SIGBUS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-23mm/memory_hotplug.c: set zone->wait_table to null after freeing itGu Zheng1-1/+3
commit 85bd839983778fcd0c1c043327b14a046e979b39 upstream. Izumi found the following oops when hot re-adding a node: BUG: unable to handle kernel paging request at ffffc90008963690 IP: __wake_up_bit+0x20/0x70 Oops: 0000 [#1] SMP CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80 Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015 task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000 RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP: 0018:ffff880017b97be8 EFLAGS: 00010246 RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9 RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648 RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800 R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000 FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0 Call Trace: unlock_page+0x6d/0x70 generic_write_end+0x53/0xb0 xfs_vm_write_end+0x29/0x80 [xfs] generic_perform_write+0x10a/0x1e0 xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs] xfs_file_write_iter+0x79/0x120 [xfs] __vfs_write+0xd4/0x110 vfs_write+0xac/0x1c0 SyS_write+0x58/0xd0 system_call_fastpath+0x12/0x76 Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48 RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP <ffff880017b97be8> CR2: ffffc90008963690 Reproduce method (re-add a node):: Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic) This seems an use-after-free problem, and the root cause is zone->wait_table was not set to *NULL* after free it in try_offline_node. When hot re-add a node, we will reuse the pgdat of it, so does the zone struct, and when add pages to the target zone, it will init the zone first (including the wait_table) if the zone is not initialized. The judgement of zone initialized is based on zone->wait_table: static inline bool zone_is_initialized(struct zone *zone) { return !!zone->wait_table; } so if we do not set the zone->wait_table to *NULL* after free it, the memory hotplug routine will skip the init of new zone when hot re-add the node, and the wait_table still points to the freed memory, then we will access the invalid address when trying to wake up the waiting people after the i/o operation with the page is done, such as mentioned above. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-03mm, numa: really disable NUMA balancing by default on single node machinesMel Gorman1-1/+1
commit b0dc2b9bb4ab782115b964310518ee0b17784277 upstream. NUMA balancing is meant to be disabled by default on UMA machines but the check is using nr_node_ids (highest node) instead of num_online_nodes (online nodes). The consequences are that a UMA machine with a node ID of 1 or higher will enable NUMA balancing. This will incur useless overhead due to minor faults with the impact depending on the workload. These are the impact on the stats when running a kernel build on a single node machine whose node ID happened to be 1: vanilla patched NUMA base PTE updates 5113158 0 NUMA huge PMD updates 643 0 NUMA page range updates 5442374 0 NUMA hint faults 2109622 0 NUMA hint local faults 2109622 0 NUMA hint local percent 100 100 NUMA pages migrated 0 0 Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-03mm: numa: initialise numa balancing after jump label initialisationMel Gorman1-6/+10
commit c297663c0b3930491a3cb2aba4b6e5a7159c3503 upstream. The command line parsing takes place before jump labels are initialised which generates a warning if numa_balancing= is specified and CONFIG_JUMP_LABEL is set. On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage before jump_label_init was called") the kernel would have crashed. This patch enables automatic numa balancing later in the initialisation process if numa_balancing= is specified. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-16mm: soft-offline: fix num_poisoned_pages counting on concurrent eventsNaoya Horiguchi1-4/+4
commit 602498f9aa43d4951eece3fd6ad95a6d0a78d537 upstream. If multiple soft offline events hit one free page/hugepage concurrently, soft_offline_page() can handle the free page/hugepage multiple times, which makes num_poisoned_pages counter increased more than once. This patch fixes this wrong counting by checking TestSetPageHWPoison for normal papes and by checking the return value of dequeue_hwpoisoned_huge_page() for hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-16writeback: use |1 instead of +1 to protect against div by zeroTejun Heo1-3/+3
commit 464d1387acb94dc43ba772b35242345e3d2ead1b upstream. mm/page-writeback.c has several places where 1 is added to the divisor to prevent division by zero exceptions; however, if the original divisor is equivalent to -1, adding 1 leads to division by zero. There are three places where +1 is used for this purpose - one in pos_ratio_polynom() and two in bdi_position_ratio(). The second one in bdi_position_ratio() actually triggered div-by-zero oops on a machine running a 3.10 kernel. The divisor is x_intercept - bdi_setpoint + 1 == span + 1 span is confirmed to be (u32)-1. It isn't clear how it ended up that but it could be from write bandwidth calculation underflow fixed by c72efb658f7c ("writeback: fix possible underflow in write bandwidth calculation"). At any rate, +1 isn't a proper protection against div-by-zero. This patch converts all +1 protections to |1. Note that bdi_update_dirty_ratelimit() was already using |1 before this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-16mm/memory-failure: call shake_page() when error hits thp tail pageNaoya Horiguchi1-4/+4
commit 09789e5de18e4e442870b2d700831f5cb802eb05 upstream. Currently memory_failure() calls shake_page() to sweep pages out from pcplists only when the victim page is 4kB LRU page or thp head page. But we should do this for a thp tail page too. Consider that a memory error hits a thp tail page whose head page is on a pcplist when memory_failure() runs. Then, the current kernel skips shake_pages() part, so hwpoison_user_mappings() returns without calling split_huge_page() nor try_to_unmap() because PageLRU of the thp head is still cleared due to the skip of shake_page(). As a result, me_huge_page() runs for the thp, which is broken behavior. One effect is a leak of the thp. And another is to fail to isolate the memory error, so later access to the error address causes another MCE, which kills the processes which used the thp. This patch fixes this problem by calling shake_page() for thp tail case. Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-15mm: fix anon_vma->degree underflow in anon_vma endless growing preventionLeon Yu2-3/+8
commit 3fe89b3e2a7bbf3e97657104b9b33a9d81b950b3 upstream. I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after upgrading to 3.19 and had no luck with 4.0-rc1 neither. So, after looking into new logic introduced by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy"), I found chances are that unlink_anon_vmas() is called without incrementing dst->anon_vma->degree in anon_vma_clone() due to allocation failure. If dst->anon_vma is not NULL in error path, its degree will be incorrectly decremented in unlink_anon_vmas() and eventually underflow when exiting as a result of another call to unlink_anon_vmas(). That's how "kernel BUG at mm/rmap.c:399!" is triggered for me. This patch fixes the underflow by dropping dst->anon_vma when allocation fails. It's safe to do so regardless of original value of dst->anon_vma because dst->anon_vma doesn't have valid meaning if anon_vma_clone() fails. Besides, callers don't care dst->anon_vma in such case neither. Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as anon_vma_clone() now does the work. [akpm@linux-foundation.org: tweak comment] Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Signed-off-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-15mm: fix corner case in anon_vma endless growing preventionKonstantin Khlebnikov1-2/+4
commit b800c91a0517071156e772d4fb329ad33590da62 upstream. Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Anon_vma_clone() is usually called for a copy of source vma in destination argument. If source vma has anon_vma it should be already in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's called from anon_vma_fork(). In this case anon_vma_clone() finds anon_vma for reusing. Vma_adjust() calls it differently and this breaks anon_vma reusing logic: anon_vma_clone() links vma to old anon_vma and updates degree counters but vma_adjust() overrides vma->anon_vma right after that. As a result final unlink_anon_vmas() decrements degree for wrong anon_vma. This patch assigns ->anon_vma before calling anon_vma_clone(). Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com> Reported-and-tested-by: Oded Gabbay <oded.gabbay@amd.com> Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu> Cc: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-15mm: prevent endless growth of anon_vma hierarchyKonstantin Khlebnikov1-1/+41
commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc upstream. Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: Michal Hocko <mhocko@suse.cz> Tested-by: Jerome Marchand <jmarchan@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-30mm/hugetlb: use pmd_page() in follow_huge_pmd()Gerald Schaefer1-2/+1
commit 97534127012f0e396eddea4691f4c9b170aed74b upstream. Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte layout differ and using pte_page() on a huge pmd will return wrong results. Using pmd_page() instead fixes this. All architectures that were touched by that commit have pmd_page() defined, so this should not break anything on other architectures. Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*" Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz>, Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-27nick kvfree() from apparmorAl Viro1-0/+10
commit 39f1f78d53b9bcbca91967380c5f0f2305a5c55f upstream. too many places open-code it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-22writeback: fix possible underflow in write bandwidth calculationTejun Heo1-1/+4
commit c72efb658f7c8b27ca3d0efb5cfd5ded9fcac89e upstream. From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Mon, 23 Mar 2015 00:08:19 -0400 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") introduced account_page_redirty() which reverts stat updates for a redirtied page, making BDI_DIRTIED no longer monotonically increasing. bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the basis for bandwidth calculation. While unlikely, since the above patch, the newer value may be lower than the recorded past value and underflow the bandwidth calculation leading to a wild result. Fix it by subtracing min of the old and new values when calculating delta. AFAIK, there hasn't been any report of it happening but the resulting erratic behavior would be non-critical and temporary, so it's possible that the issue is happening without being reported. The risk of the fix is very low, so tagged for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Fixes: 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-22writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()Tejun Heo1-1/+1
commit 7d70e15480c0450d2bfafaad338a32e884fc215e upstream. global_update_bandwidth() uses static variable update_time as the timestamp for the last update but forgets to initialize it to INITIALIZE_JIFFIES. This means that global_dirty_limit will be 5 mins into the future on 32bit and some large amount jiffies into the past on 64bit. This isn't critical as the only effect is that global_dirty_limit won't be updated for the first 5 mins after booting on 32bit machines, especially given the auxiliary nature of global_dirty_limit's role - protecting against global dirty threshold's sudden dips; however, it does lead to unintended suboptimal behavior. Fix it. Fixes: c42843f2f0bb ("writeback: introduce smoothed global dirty limit") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-22mm/memory hotplug: postpone the reset of obsolete pgdatGu Zheng1-9/+4
commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream. Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition: BUG: unable to handle kernel paging request at 0000000000025f60 IP: next_online_pgdat+0x1/0x50 PGD 0 Oops: 0000 [#1] SMP ACPI: Device does not support D3cold Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 Workqueue: events vmstat_update task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000 RIP: 0010: next_online_pgdat+0x1/0x50 RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286 RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082 RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000 RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96 R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440 R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800 FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: refresh_cpu_vm_stats+0xd0/0x140 vmstat_update+0x11/0x50 process_one_work+0x194/0x3d0 worker_thread+0x12b/0x410 kthread+0xc6/0xd0 ret_from_fork+0x7c/0xb0 The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of try_offline_node, which will reset all the content of pgdat to 0, as the pgdat is accessed lock-free, so that the users still using the pgdat will panic, such as the vmstat_update routine. process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat)) zone = next_zone(zone) pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now next_online_pgdat(pgdat) next_online_node(pgdat->node_id); // NULL pointer access So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Xishi Qiu <qiuxishi@huawei.com> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09mm: thp: give transparent hugepage code a separate copy_pageDave Hansen2-34/+48
commit 30b0a105d9f7141e4cbf72ae5511832457d89788 upstream. Right now, the migration code in migrate_page_copy() uses copy_huge_page() for hugetlbfs and thp pages: if (PageHuge(page) || PageTransHuge(page)) copy_huge_page(newpage, page); So, yay for code reuse. But: void copy_huge_page(struct page *dst, struct page *src) { struct hstate *h = page_hstate(src); and a non-hugetlbfs page has no page_hstate(). This works 99% of the time because page_hstate() determines the hstate from the page order alone. Since the page order of a THP page matches the default hugetlbfs page order, it works. But, if you change the default huge page size on the boot command-line (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate so page_hstate() returns null and copy_huge_page() oopses pretty fast since copy_huge_page() dereferences the hstate: void copy_huge_page(struct page *dst, struct page *src) { struct hstate *h = page_hstate(src); if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) { ... Mel noticed that the migration code is really the only user of these functions. This moves all the copy code over to migrate.c and makes copy_huge_page() work for THP by checking for it explicitly. I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add THP migration for the NUMA working set scanning fault case") [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()Naoya Horiguchi1-15/+36
commit 0f792cf949a0be506c2aa8bfac0605746b146dda upstream. When running the test which causes the race as shown in the previous patch, we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault(). This race happens when pte turns into migration entry just after the first check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false. To fix this, we need to check pte_present() again after huge_ptep_get(). This patch also reorders taking ptl and doing pte_page(), because pte_page() should be done in ptl. Due to this reordering, we need use trylock_page() in page != pagecache_page case to respect locking order. Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: <stable@vger.kernel.org> [3.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> [backport to 3.12]