summaryrefslogtreecommitdiff
path: root/mm/migrate.c
AgeCommit message (Collapse)AuthorFilesLines
2023-09-30mm: hugetlb: add huge page size param to set_huge_pte_at()Ryan Roberts1-2/+5
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-30Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds1-2/+2
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-18migrate: use folio_set_bh() instead of set_bh_page()Matthew Wilcox (Oracle)1-1/+1
This function was converted before folio_set_bh() existed. Catch up to the new API. Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: merge folio_has_private()/filemap_release_folio() call pairsDavid Howells1-2/+1
Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig1-2/+2
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-12mm: Make pte_mkwrite() take a VMARick Edgecombe1-1/+1
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted callers that don't have a VMA were to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com
2023-06-24mm: remove unnecessary pagevec includesMatthew Wilcox (Oracle)1-1/+0
These files no longer need pagevec.h, mostly due to function declarations being moved out of it. Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-24mm: fix shmem THP counters on migrationJan Glauber1-0/+5
The per node numa_stat values for shmem don't change on page migration for THP: grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat: shmem N0=1092616192 N1=10485760 shmem_thp N0=1092616192 N1=10485760 migratepages 9181 0 1: shmem N0=0 N1=1103101952 shmem_thp N0=1092616192 N1=10485760 Fix that by updating shmem_thp counters likewise to shmem counters on page migration. [jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge] Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com Signed-off-by: Jan Glauber <jglauber@digitalocean.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm: ptep_get() conversionRyan Roberts1-6/+8
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/various: give up if pte_offset_map[_lock]() failsHugh Dickins1-0/+3
Following the examples of nearby code, various functions can just give up if pte_offset_map() or pte_offset_map_lock() fails. And there's no need for a preliminary pmd_trans_unstable() or other such check, since such cases are now safely handled inside. Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/migrate: remove cruft from migration_entry_wait()sHugh Dickins1-24/+13
migration_entry_wait_on_locked() does not need to take a mapped pte pointer, its callers can do the unmap first. Annotate it with __releases(ptl) to reduce sparse warnings. Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold __migration_entry_wait() into migration_entry_wait(), preferring the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr(). Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: convert migrate_pages() to work on foliosMatthew Wilcox (Oracle)1-86/+75
Almost all of the callers & implementors of migrate_pages() were already converted to use folios. compaction_alloc() & compaction_free() are trivial to convert a part of this patch and not worth splitting out. Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10migrate_pages_batch: simplify retrying and failure counting of large foliosHuang Ying1-76/+36
After recent changes to the retrying and failure counting in migrate_pages_batch(), it was found that it's unnecessary to count retrying and failure for normal, large, and THP folios separately. Because we don't use retrying and failure number of large folios directly. So, in this patch, we simplified retrying and failure counting of large folios via counting retrying and failure of normal and large folios together. This results in the reduced line number. Previously, in migrate_pages_batch we need to track whether the source folio is large/THP before splitting. So is_large is used to cache folio_test_large() result. Now, we don't need that variable any more because we don't count retrying and failure of large folios (only counting that of THP folios). So, in this patch, is_large is removed to simplify the code. This is just code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHTDouglas Anderson1-23/+26
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal. When putting a Chromebook under memory pressure (opening over 90 tabs on a 4GB machine) it was fairly easy to see delays waiting for some locks in the kcompactd code path of > 100 ms. While the laptop wasn't amazingly usable in this state, it was still limping along and this state isn't something artificial. Sometimes we simply end up with a lot of memory pressure. Putting the same Chromebook under memory pressure while it was running Android apps (though not stressing them) showed a much worse result (NOTE: this was on a older kernel but the codepaths here are similar). Android apps on ChromeOS currently run from a 128K-block, zlib-compressed, loopback-mounted squashfs disk. If we get a page fault from something backed by the squashfs filesystem we could end up holding a folio lock while reading enough from disk to decompress 128K (and then decompressing it using the somewhat slow zlib algorithms). That reading goes through the ext4 subsystem (because it's a loopback mount) before eventually ending up in the block subsystem. This extra jaunt adds extra overhead. Without much work I could see cases where we ended up blocked on a folio lock for over a second. With more extreme memory pressure I could see up to 25 seconds. We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the two locks that were seen to be slow [1] and that generated much discussion. After discussion, it was decided that we should avoid waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being held for IO. We'll continue with the unbounded wait for the more full SYNC modes. With this change, I couldn't see any slow waits on these locks with my previous testcases. NOTE: The reason I stated digging into this originally isn't because some benchmark had gone awry, but because we've received in-the-field crash reports where we have a hung task waiting on the page lock (which is the equivalent code path on old kernels). While the root cause of those crashes is likely unrelated and won't be fixed by this patch, analyzing those crash reports did point out these very long waits seemed like something good to fix. With this patch we should no longer hang waiting on these locks, but presumably the system will still be in a bad shape and hang somewhere else. [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-22mm: don't check VMA write permissions if the PTE/PMD indicates write permissionsDavid Hildenbrand1-4/+1
Staring at the comment "Recheck VMA as permissions can change since migration started" in remove_migration_pte() can result in confusion, because if the source PTE/PMD indicates write permissions, then there should be no need to check VMA write permissions when restoring migration entries or PTE-mapping a PMD. Commit d3cb8bf6081b ("mm: migrate: Close race between migration completion and mprotect") introduced the maybe_mkwrite() handling in remove_migration_pte() in 2014, stating that a race between mprotect() and migration finishing would be possible, and that we could end up with a writable PTE that should be readable. However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot and then walks the page tables to (a) set all present writable PTEs to read-only and (b) convert all writable migration entries to readable migration entries. While walking the page tables and modifying the entries, migration code has to grab the PT locks to synchronize against concurrent page table modifications. Assuming migration would find a writable migration entry (while holding the PT lock) and replace it with a writable present PTE, surely mprotect() code didn't stumble over the writable migration entry yet (converting it into a readable migration entry) and would instead wait for the PT lock to convert the now present writable PTE into a read-only PTE. As mprotect() didn't finish yet, the behavior is just like migration didn't happen: a writable PTE will be converted to a read-only PTE. So it's fine to rely on the writability information in the source PTE/PMD and not recheck against the VMA as long as we're holding the PT lock to synchronize with anyone who concurrently wants to downgrade write permissions (like mprotect()) by first adjusting vma->vm_flags / vma->vm_page_prot to then walk over the page tables to adjust the page table entries. Running test cases that should reveal such races -- mprotect(PROT_READ) racing with page migration or THP splitting -- for multiple hours did not reveal an issue with this cleanup. Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-22migrate_pages_batch: fix statistics for longterm pin retryHuang Ying1-0/+3
In commit fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt"), if the THP splitting fails due to page reference count, we will retry to improve migration successful rate. But the failed splitting is counted as migration failure and migration retry, which will cause duplicated failure counting. So, in this patch, this is fixed via undoing the failure counting if we decide to retry. The patch is tested via failure injection. Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com Fixes: fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-19mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on ↵David Hildenbrand1-2/+0
sparc64" This reverts commit 96a9c287e25d ("mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"). Now that sparc64 mkdirty handling is fixed and no longer sets a PTE/PMD writable that shouldn't be writable, let's revert the temporary fix. The mkdirty mm selftest still passes with this change on sparc64. Note that loongarch handling was fixed in commit bf2f34a506e6 ("LoongArch: Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()"). Link: https://lkml.kernel.org/r/20230411142512.438404-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-29mm/migrate: drop pte_mkhuge() in remove_migration_pte()Anshuman Khandual1-1/+0
Since the following commit, arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). This just drops pte_mkhuge() from remove_migration_pte(), which has now become redundant. 'commit 16785bd77431 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-16mm: Introduce untagged_addr_remote()Kirill A. Shutemov1-5/+6
untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com
2023-03-08migrate_pages: try migrate in batch asynchronously firstlyHuang Ying1-18/+62
When we have locked more than one folios, we cannot wait the lock or bit (e.g., page lock, buffer head lock, writeback bit) synchronously. Otherwise deadlock may be triggered. This make it hard to batch the synchronous migration directly. This patch re-enables batching synchronous migration via trying to migrate in batch asynchronously firstly. And any folios that are failed to be migrated asynchronously will be migrated synchronously one by one. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-08migrate_pages: move split folios processing out of migrate_pages_batch()Huang Ying1-50/+28
To simplify the code logic and reduce the line number. Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-08migrate_pages: fix deadlock in batched migrationHuang Ying1-43/+26
Patch series "migrate_pages: fix deadlock in batched synchronous migration", v2. Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei. Analysis shows that if we have locked some other folios except the one we are migrating, it's not safe in general to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. So 1/3 fixes the deadlock in a simple way, where the batching support for the synchronous migration is disabled. The change is straightforward and easy to be understood. While 3/3 re-introduce the batching for synchronous migration via trying to migrate asynchronously in batch optimistically, then fall back to migrate synchronously one by one for fail-to-migrate folios. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. This patch (of 3): Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei! For example, in the following deadlock trace snippet, INFO: task kworker/u4:0:9 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 Workqueue: loop4 loop_rootcg_workfn Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? __pfx_wake_page_function+0x10/0x10 __filemap_get_folio+0x73d/0x770 shmem_get_folio_gfp+0x1fd/0xc80 shmem_write_begin+0x91/0x220 generic_perform_write+0x10e/0x2e0 __generic_file_write_iter+0x17e/0x290 ? generic_write_checks+0x12b/0x1a0 generic_file_write_iter+0x97/0x180 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_readv_writev+0x13c/0x210 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_write+0xf6/0x330 vfs_iter_write+0x46/0x70 loop_process_work+0x723/0xfe0 loop_rootcg_workfn+0x28/0x40 process_one_work+0x3cc/0x8d0 worker_thread+0x66/0x630 ? __pfx_worker_thread+0x10/0x10 kthread+0x153/0x190 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> INFO: task repro:1023 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? compaction_alloc+0x77/0x1150 ? __pfx_wake_page_function+0x10/0x10 folio_wait_bit+0x30/0x40 folio_wait_writeback+0x2e/0x1e0 migrate_pages_batch+0x555/0x1ac0 ? __pfx_compaction_alloc+0x10/0x10 ? __pfx_compaction_free+0x10/0x10 ? __this_cpu_preempt_check+0x17/0x20 ? lock_is_held_type+0xe6/0x140 migrate_pages+0x100e/0x1180 ? __pfx_compaction_free+0x10/0x10 ? __pfx_compaction_alloc+0x10/0x10 compact_zone+0xe10/0x1b50 ? lock_is_held_type+0xe6/0x140 ? check_preemption_disabled+0x80/0xf0 compact_node+0xa3/0x100 ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 ? _find_first_bit+0x7b/0x90 sysctl_compaction_handler+0x5d/0xb0 proc_sys_call_handler+0x29d/0x420 proc_sys_write+0x2b/0x40 vfs_write+0x3a3/0x780 ksys_write+0xb7/0x180 __x64_sys_write+0x26/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f3a2471f59d RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 </TASK> The page migration task has held the lock of the shmem folio A, and is waiting the writeback of the folio B of the file system on the loop block device to complete. While the loop worker task which writes back the folio B is waiting to lock the shmem folio A, because the folio A backs the folio B in the loop device. Thus deadlock is triggered. In general, if we have locked some other folios except the one we are migrating, it's not safe to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. To fix the deadlock, in this patch, we avoid to batch the page migration except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting is avoided. The fix can be improved further. We will do that as soon as possible. Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/ Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-05mm: avoid gcc complaint about pointer castingLinus Torvalds1-2/+8
The migration code ends up temporarily stashing information of the wrong type in unused fields of the newly allocated destination folio. That all works fine, but gcc does complain about the pointer type mis-use: mm/migrate.c: In function ‘__migrate_folio_extract’: mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’ 1050 | *anon_vmap = (void *)dst->mapping; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ and gcc is actually right to complain since it really doesn't understand that this is a very temporary special case where this is ok. This could be fixed in different ways by just obfuscating the assignment sufficiently that gcc doesn't see what is going on, but the truly "proper C" way to do this is by explicitly using a union. Using unions for type conversions like this is normally hugely ugly and syntactically nasty, but this really is one of the few cases where we want to make it clear that we're not doing type conversion, we're really re-using the value bit-for-bit just using another type. IOW, this should not become a common pattern, but in this one case using that odd union is probably the best way to document to the compiler what is conceptually going on here. [ Side note: there are valid cases where we convert pointers to other pointer types, notably the whole "folio vs page" situation, where the types actually have fundamental commonalities. The fact that the gcc note is limited to just randomized structures means that we don't see equivalent warnings for those cases, but it migth also mean that we miss other cases where we do play these kinds of dodgy games, and this kind of explicit conversion might be a good idea. ] I verified that at least for an allmodconfig build on x86-64, this generates the exact same code, apart from line numbers and assembler comment changes. Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Cc: Huang, Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-24Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-281/+653
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-20mm: change to return bool for isolate_movable_page()Baolin Wang1-3/+3
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users will care about the negative return value, thus we can convert the isolate_movable_page() to return a boolean value to make the code more clear when checking the movable page isolation state. No functional changes intended. [akpm@linux-foundation.org: remove unneeded comment, per Matthew] Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20mm: hugetlb: change to return bool for isolate_hugetlb()Baolin Wang1-4/+3
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not care about the negative value, thus we can convert the isolate_hugetlb() to return a boolean value to make code more clear when checking the hugetlb isolation state. Moreover converts 2 users which will consider the negative value returned by isolate_hugetlb(). No functional changes intended. [akpm@linux-foundation.org: shorten locked section, per SeongJae Park] Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20mm: change to return bool for isolate_lru_page()Baolin Wang1-3/+6
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-18mm/migrate: fix wrongly apply write bit after mkdirty on sparc64Peter Xu1-0/+2
Nick Bowler reported another sparc64 breakage after the young/dirty persistent work for page migration (per "Link:" below). That's after a similar report [2]. It turns out page migration was overlooked, and it wasn't failing before because page migration was not enabled in the initial report test environment. David proposed another way [2] to fix this from sparc64 side, but that patch didn't land somehow. Neither did I check whether there's any other arch that has similar issues. Let's fix it for now as simple as moving the write bit handling to be after dirty, like what we did before. Note: this is based on mm-unstable, because the breakage was since 6.1 and we're at a very late stage of 6.2 (-rc8), so I assume for this specific case we should target this at 6.3. [1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ [2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/ Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com Fixes: 2e3468778dbe ("mm: remember young/dirty bit for page migrations") Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/ Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Nick Bowler <nbowler@draconx.ca> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nick Bowler <nbowler@draconx.ca> Cc: <regressions@lists.linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: move THP/hugetlb migration support check to simplify codeHuang Ying1-47/+36
This is a code cleanup patch, no functionality change is expected. After the change, the line number reduces especially in the long migrate_pages_batch(). Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: batch flushing TLBHuang Ying1-1/+4
The TLB flushing will cost quite some CPU cycles during the folio migration in some situations. For example, when migrate a folio of a process with multiple active threads that run on multiple CPUs. After batching the _unmap and _move in migrate_pages(), the TLB flushing can be batched easily with the existing TLB flush batching mechanism. This patch implements that. We use the following test case to test the patch. On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2 NUMA nodes. Test results show that the page migration performance increases up to 78%. NOTE: TLB flushing is batched only for normal folios, not for THP folios. Because the overhead of TLB flushing for THP folios is much lower than that for normal folios (about 1/512 on x86 platform). Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: share more code between _unmap and _moveHuang Ying1-122/+85
This is a code cleanup patch to reduce the duplicated code between the _unmap and _move stages of migrate_pages(). No functionality change is expected. Link: https://lkml.kernel.org/r/20230213123444.155149-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: move migrate_folio_unmap()Huang Ying1-50/+50
Just move the position of the functions. There's no any functionality change. This is to make it easier to review the next patch via putting code near its position in the next patch. Link: https://lkml.kernel.org/r/20230213123444.155149-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: batch _unmap and _moveHuang Ying1-25/+189
In this patch the _unmap and _move stage of the folio migration is batched. That for, previously, it is, for each folio _unmap() _move() Now, it is, for each folio _unmap() for each folio _move() Based on this, we can batch the TLB flushing and use some hardware accelerator to copy folios between batched _unmap and batched _move stages. Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: split unmap_and_move() to _unmap() and _move()Huang Ying1-41/+128
This is a preparation patch to batch the folio unmapping and moving. In this patch, unmap_and_move() is split to migrate_folio_unmap() and migrate_folio_move(). So, we can batch _unmap() and _move() in different loops later. To pass some information between unmap and move, the original unused dst->mapping and dst->private are used. Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: restrict number of pages to migrate in batchHuang Ying1-68/+106
This is a preparation patch to batch the folio unmapping and moving for non-hugetlb folios. If we had batched the folio unmapping, all folios to be migrated would be unmapped before copying the contents and flags of the folios. If the folios that were passed to migrate_pages() were too many in unit of pages, the execution of the processes would be stopped for too long time, thus too long latency. For example, migrate_pages() syscall will call migrate_pages() with all folios of a process. To avoid this possible issue, in this patch, we restrict the number of pages to be migrated to be no more than HPAGE_PMD_NR. That is, the influence is at the same level of THP migration. Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: separate hugetlb folios migrationHuang Ying1-22/+119
This is a preparation patch to batch the folio unmapping and moving for the non-hugetlb folios. Based on that we can batch the TLB shootdown during the folio migration and make it possible to use some hardware accelerator for the folio copying. In this patch the hugetlb folios and non-hugetlb folios migration is separated in migrate_pages() to make it easy to change the non-hugetlb folios migration implementation. Link: https://lkml.kernel.org/r/20230213123444.155149-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17migrate_pages: organize stats with struct migrate_pages_statsHuang Ying1-26/+34
Patch series "migrate_pages(): batch TLB flushing", v5. Now, migrate_pages() migrates folios one by one, like the fake code as follows, for each folio unmap flush TLB copy restore map If multiple folios are passed to migrate_pages(), there are opportunities to batch the TLB flushing and copying. That is, we can change the code to something as follows, for each folio unmap for each folio flush TLB for each folio copy for each folio restore map The total number of TLB flushing IPI can be reduced considerably. And we may use some hardware accelerator such as DSA to accelerate the folio copying. So in this patch, we refactor the migrate_pages() implementation and implement the TLB flushing batching. Base on this, hardware accelerated folio copying can be implemented. If too many folios are passed to migrate_pages(), in the naive batched implementation, we may unmap too many folios at the same time. The possibility for a task to wait for the migrated folios to be mapped again increases. So the latency may be hurt. To deal with this issue, the max number of folios be unmapped in batch is restricted to no more than HPAGE_PMD_NR in the unit of page. That is, the influence is at the same level of THP migration. We use the following test to measure the performance impact of the patchset, On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. Xin Hao helped to test the patchset on an ARM64 server with 128 cores, 2 NUMA nodes. Test results show that the page migration performance increases up to 78%. This patch (of 9): Define struct migrate_pages_stats to organize the various statistics in migrate_pages(). This makes it easier to collect and consume the statistics in multiple functions. This will be needed in the following patches in the series. Link: https://lkml.kernel.org/r/20230213123444.155149-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20230213123444.155149-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/migrate: convert putback_movable_pages() to use foliosVishal Moola (Oracle)1-23/+23
Removes 6 calls to compound_head(), and replaces putback_movable_page() with putback_movable_folio() as well. Link: https://lkml.kernel.org/r/20230130214352.40538-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/migrate: convert isolate_movable_page() to use foliosVishal Moola (Oracle)1-19/+20
Removes 6 calls to compound_head() and prepares the function to take in a folio instead of page argument. Link: https://lkml.kernel.org/r/20230130214352.40538-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/migrate: add folio_movable_ops()Vishal Moola (Oracle)1-1/+1
folio_movable_ops() does the same as page_movable_ops() except uses folios instead of pages. This function will help make folio conversions in migrate.c more readable. Link: https://lkml.kernel.org/r/20230130214352.40538-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/hugetlb: convert putback_active_hugepage to take in a folioSidhartha Kumar1-4/+4
Convert putback_active_hugepage() to folio_putback_active_hugetlb(), this removes one user of the Huge Page macros which take in a page. The callers in migrate.c are also cleaned up by being able to directly use the src and dst folio variables. Link: https://lkml.kernel.org/r/20230125170537.96973-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/hugetlb: convert alloc_migrate_huge_page to foliosSidhartha Kumar1-1/+4
Change alloc_huge_page_nodemask() to alloc_hugetlb_folio_nodemask() and alloc_migrate_huge_page() to alloc_migrate_hugetlb_folio(). Both functions now return a folio rather than a page. Link: https://lkml.kernel.org/r/20230113223057.173292-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-14mm/hugetlb: convert isolate_hugetlb to foliosSidhartha Kumar1-1/+1
Patch series "continue hugetlb folio conversion", v3. This series continues the conversion of core hugetlb functions to use folios. This series converts many helper funtions in the hugetlb fault path. This is in preparation for another series to convert the hugetlb fault code paths to operate on folios. This patch (of 8): Convert isolate_hugetlb() to take in a folio and convert its callers to pass a folio. Use page_folio() to convert the callers to use a folio is safe as isolate_hugetlb() operates on a head page. Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-03mm: mlock: update the interface to use foliosLorenzo Stoakes1-1/+1
Update the mlock interface to accept folios rather than pages, bringing the interface in line with the internal implementation. munlock_vma_page() still requires a page_folio() conversion, however this is consistent with the existent mlock_vma_page() implementation and a product of rmap still dealing in pages rather than folios. Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: move swap entry handling into vma lock when faultedPeter Xu1-4/+21
In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-14Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds1-132/+153
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-13Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with many of the bug fixes found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: fix reserved cluster accounting in __es_remove_extent() ext4: fix inode leak in ext4_xattr_inode_create() on an error path ext4: allocate extended attribute value in vmalloc area ext4: avoid unaccounted block allocation when expanding inode ext4: initialize quota before expanding inode in setproject ioctl ext4: stop providing .writepage hook mm: export buffer_migrate_folio_norefs() ext4: switch to using write_cache_pages() for data=journal writeout jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout ext4: switch to using ext4_do_writepages() for ordered data writeout ext4: move percpu_rwsem protection into ext4_writepages() ext4: provide ext4_do_writepages() ext4: add support for writepages calls that cannot map blocks ext4: drop pointless IO submission from ext4_bio_write_page() ext4: remove nr_submitted from ext4_bio_write_page() ext4: move keep_towrite handling to ext4_bio_write_page() ext4: handle redirtying in ext4_bio_write_page() ext4: fix kernel BUG in 'ext4_write_inline_data_end()' ext4: make ext4_mb_initialize_context return void ext4: fix deadlock due to mbcache entry corruption ...
2022-12-12Merge tag 'slab-for-6.2-rc1' of ↵Linus Torvalds1-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLOB deprecation and SLUB_TINY The SLOB allocator adds maintenance burden and stands in the way of API improvements [1]. Deprecate it by renaming the config option (to make users notice) to CONFIG_SLOB_DEPRECATED with updated help text. SLUB should be used instead as SLAB will be the next on the removal list. Based on reports from a riscv k210 board with 8MB RAM, add a CONFIG_SLUB_TINY option to minimize SLUB's memory usage at the expense of scalability. This has resolved the k210 regression [2] so in case there are no others (that wouldn't be resolvable by further tweaks to SLUB_TINY) plan is to remove SLOB in a few cycles. Existing defconfigs with CONFIG_SLOB are converted to CONFIG_SLUB_TINY. - kmalloc() slub_debug redzone improvements A series from Feng Tang that builds on the tracking or requested size for kmalloc() allocations (for caches with debugging enabled) added in 6.1, to make redzone checks consider the requested size and not the rounded up one, in order to catch more subtle buffer overruns. Includes new slub_kunit test. - struct slab fields reordering to accomodate larger rcu_head RCU folks would like to grow rcu_head with debugging options, which breaks current struct slab layout's assumptions, so reorganize it to make this possible. - Miscellaneous improvements/fixes: - __alloc_size checking compiler workaround (Kees Cook) - Optimize and cleanup SLUB's sysfs init (Rasmus Villemoes) - Make SLAB compatible with PROVE_RAW_LOCK_NESTING (Jiri Kosina) - Correct SLUB's percpu allocation estimates (Baoquan He) - Re-enableS LUB's run-time failslab sysfs control (Alexander Atanasov) - Make tools/vm/slabinfo more user friendly when not run as root (Rong Tao) - Dead code removal in SLUB (Hyeonggon Yoo) * tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (31 commits) mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED mm, slub: don't aggressively inline with CONFIG_SLUB_TINY mm, slub: remove percpu slabs with CONFIG_SLUB_TINY mm, slub: split out allocations from pre/post hooks mm/slub, kunit: Add a test case for kmalloc redzone check mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation mm, slub: refactor free debug processing mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY mm, slub: disable SYSFS support with CONFIG_SLUB_TINY mm, slub: add CONFIG_SLUB_TINY mm, slab: ignore hardened usercopy parameters when disabled slab: Remove special-casing of const 0 size allocations slab: Clean up SLOB vs kmalloc() definition mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head mm/migrate: make isolate_movable_page() skip slab pages mm/slab: move and adjust kernel-doc for kmem_cache_alloc mm/slub, percpu: correct the calculation of early percpu allocation size ...