summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-18mm/memory: convert do_shared_fault() to foliosSidhartha Kumar1-4/+7
Saves three implicit calls to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert wp_page_shared() to use foliosSidhartha Kumar1-11/+11
Saves six implicit calls to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert do_page_mkwrite() to use foliosSidhartha Kumar1-5/+5
Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: hugetlb_vmemmap: fix a race between vmemmap pmd splitMuchun Song1-20/+14
The local variable @page in __split_vmemmap_huge_pmd() to obtain a pmd page without holding page_table_lock may possiblely get the page table page instead of a huge pmd page. The effect may be in set_pte_at() since we may pass an invalid page struct, if set_pte_at() wants to access the page struct (e.g. CONFIG_PAGE_TABLE_CHECK is enabled), it may crash the kernel. So fix it. And inline __split_vmemmap_huge_pmd() since it only has one user. Link: https://lkml.kernel.org/r/20230707033859.16148-1-songmuchun@bytedance.com Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/sparse: remove redundant judgments from macro for_each_present_section_nrliuq1-2/+1
next_present_section_nr() has already ensured that 'section_nr<=__highest_present_section_nr', so this check is removed. Link: https://lkml.kernel.org/r/20230707060501.29184-1-liuq131@chinatelecom.cn Signed-off-by: liuq <liuq131@chinatelecom.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: compaction: skip the memory hole rapidly when isolating free pagesBaolin Wang1-1/+33
Just like commit 9721fd82351d ("mm: compaction: skip memory hole rapidly when isolating migratable pages"), I can see it will also take more time to skip the larger memory hole (range: 0x1000000000 - 0x1800000000) when isolating free pages on my machine with below memory layout. So like commit 9721fd82351d, adding a new helper to skip the memory hole rapidly, which can reduce the time consumed from about 70us to less than 1us. [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] [ 0.000000] DMA32 empty [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] [shikemeng@huaweicloud.com: avoid missing last page block in section after skip offline sections] Link: https://lkml.kernel.org/r/20230804110454.2935878-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230804110454.2935878-2-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/d2ba7e41ee566309b594311207ffca736375fc16.1688715750.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: compaction: use the correct type of list for free pagesBaolin Wang1-2/+2
Use the page->buddy_list instead of page->lru to clarify the correct type of list for free pages. Link: https://lkml.kernel.org/r/b21cd8e2e32b9a1d9bc9e43ebf8acaf35e87f8df.1688715750.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: fix some kernel-doc commentsYang Li1-1/+2
Add description of @mm_wr_locked and @mm. to silence the warnings: mm/memory.c:1716: warning: Function parameter or member 'mm_wr_locked' not described in 'unmap_vmas' mm/memory.c:5110: warning: Function parameter or member 'mm' not described in 'mm_account_fault' Link: https://lkml.kernel.org/r/20230707090034.125511-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: correct stale comment of function check_pteKemeng Shi1-5/+7
Commit 2aff7a4755bed ("mm: Convert page_vma_mapped_walk to work on PFNs") replaced page with pfns in page_vma_mapped_walk structure and updated "@pvmw->page" to "@pvmw->pfn" in comment of function page_vma_mapped_walk. This patch update stale "page" to "pfn" in comment of check_pte. Link: https://lkml.kernel.org/r/20230707153953.1380615-1-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm, netfs, fscache: stop read optimisation when folio removed from pagecacheDavid Howells8-1/+33
Fscache has an optimisation by which reads from the cache are skipped until we know that (a) there's data there to be read and (b) that data isn't entirely covered by pages resident in the netfs pagecache. This is done with two flags manipulated by fscache_note_page_release(): if (... test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) && test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate that netfslib should download from the server or clear the page instead. The fscache_note_page_release() function is intended to be called from ->releasepage() - but that only gets called if PG_private or PG_private_2 is set - and currently the former is at the discretion of the network filesystem and the latter is only set whilst a page is being written to the cache, so sometimes we miss clearing the optimisation. Fix this by following Willy's suggestion[1] and adding an address_space flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call ->release_folio() if it's set, even if PG_private or PG_private_2 aren't set. Note that this would require folio_test_private() and page_has_private() to become more complicated. To avoid that, in the places[*] where these are used to conditionalise calls to filemap_release_folio() and try_to_release_page(), the tests are removed the those functions just jumped to unconditionally and the test is performed there. [*] There are some exceptions in vmscan.c where the check guards more than just a call to the releaser. I've added a function, folio_needs_release() to wrap all the checks for that. AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from fscache and cleared in ->evict_inode() before truncate_inode_pages_final() is called. Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared and the optimisation cancelled if a cachefiles object already contains data when we open it. [dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()] Link: https://github.com/DaveWysochanskiRH/kernel/commit/902c990e311120179fa5de99d68364b2947b79ec Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page") Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Daire Byrne <daire.byrne@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: merge folio_has_private()/filemap_release_folio() call pairsDavid Howells10-29/+27
Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18rmap: pass the folio to __page_check_anon_rmap()Matthew Wilcox (Oracle)1-4/+4
The lone caller already has the folio, so pass it in instead of deriving it from the page again. Link: https://lkml.kernel.org/r/20230706195251.2707542-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: cma: print cma name as well in cma_alloc debugPintu Kumar1-2/+2
CMA allocation can happen either from global cma or from dedicated cma region. Thus it is helpful to print cma name as well during initial debugging to confirm cma regions were getting initialized or not. Link: https://lkml.kernel.org/r/1688668414-12350-1-git-send-email-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: remove obsolete comment above struct per_cpu_pagesMiaohe Lin1-1/+0
Since commit 01b44456a7aa ("mm/page_alloc: replace local_lock with normal spinlock"), per_cpu_pages is protected by normal spinlock. Remove the obsolete comment as it's not that helpful. Link: https://lkml.kernel.org/r/20230706092441.1574950-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18memory tier: rename destroy_memory_type() to put_memory_type()Miaohe Lin3-7/+7
It appears that destroy_memory_type() isn't a very good name because we usually will not free the memory_type here. So rename it to a more appropriate name i.e. put_memory_type(). Link: https://lkml.kernel.org/r/20230706063905.543800-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang, Ying <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCEDJeff Xu1-0/+5
Add selftest for sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) memfd_create(.., MFD_EXEC) should fail in this case. Link: https://lkml.kernel.org/r/20230705063315.3680666-3-jeffxu@google.com Reported-by: Dominique Martinet <asmadeus@codewreck.org> Closes: https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/ Signed-off-by: Jeff Xu <jeffxu@google.com> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCEDJeff Xu1-24/+33
Patch series "mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED", v2. When sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED), memfd_create(.., MFD_EXEC) should fail. This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected" Thanks to Dominique Martinet <asmadeus@codewreck.org> who reported the bug. see [1] for context. [1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/ This patch (of 2): When vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED), memfd_create(.., MFD_EXEC) should fail. This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected" Link: https://lkml.kernel.org/r/20230705063315.3680666-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20230705063315.3680666-2-jeffxu@google.com Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Reported-by: Dominique Martinet <asmadeus@codewreck.org> Closes: https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306301351.kkbSegQW-lkp@intel.com/ Signed-off-by: Jeff Xu <jeffxu@google.com> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18fs: drop_caches: draining pages before dropping cachesAndrew Yang1-0/+2
We expect a file page access after dropping caches should be a major fault, but sometimes it's still a minor fault. That's because a file page can't be dropped if it's in a per-cpu pagevec. Draining all pages from per-cpu pagevec to lru list before trying to drop caches. Link: https://lkml.kernel.org/r/20230630092203.16080-1-andrew.yang@mediatek.com Signed-off-by: Andrew Yang <andrew.yang@mediatek.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18memcg: drop kmem.limit_in_bytesMichal Hocko2-12/+0
kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been deprecated since 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") merged in 5.16. We haven't heard about any serious users since then but it seems that the mere presence of the file is causing more harm thatn good. We (SUSE) have had several bug reports from customers where Docker based containers started to fail because a write to kmem.limit_in_bytes has failed. This was unexpected because runc code only expects ENOENT (kmem disabled) or EBUSY (tasks already running within cgroup). So a new error code was unexpected and the whole container startup failed. This has been later addressed by https://github.com/opencontainers/runc/commit/52390d68040637dfc77f9fda6bbe70952423d380 so current Docker runtimes do not suffer from the problem anymore. There are still older version of Docker in use and likely hard to get rid of completely. Address this by wiping out the file completely and effectively get back to pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration. I would recommend backporting to stable trees which have picked up 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes"). [mhocko@suse.com: restore _KMEM switch case] Link: https://lkml.kernel.org/r/ZKe5wxdbvPi5Cwd7@dhcp22.suse.cz Link: https://lkml.kernel.org/r/20230704115240.14672-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: page_alloc: avoid false page outside zone error infoMiaohe Lin1-3/+2
If pfn is outside zone boundaries in the first round, ret will be set to 1. But if pfn is changed to inside the zone boundaries in zone span seqretry path, ret is still set to 1 leading to false page outside zone error info. This is from code inspection. The race window should be really small thus hard to trigger in real world. [akpm@linux-foundation.org: code simplification, per Matthew] Link: https://lkml.kernel.org/r/20230704111823.940331-1-linmiaohe@huawei.com Fixes: bdc8cb984576 ("[PATCH] memory hotplug locking: zone span seqlock") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftest: add a testcase of ksm zero pagesxu xin1-1/+97
Add a function test_unmerge_zero_page() to test the functionality on unsharing and counting ksm-placed zero pages and counting of this patch series. test_unmerge_zero_page() actually contains four subjct test objects: (1) whether the count of ksm zero pages can update correctly after merging; (2) whether the count of ksm zero pages can update correctly after unmerging by madvise(...MADV_UNMERGEABLE); (3) whether the count of ksm zero pages can update correctly after unmerging by triggering write fault. (4) whether ksm zero pages are really unmerged. Link: https://lkml.kernel.org/r/20230613030947.186089-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: consider KSM-placed zeropages when calculating KSM profitxu xin2-9/+13
When use_zero_pages is enabled, the calculation of ksm profit is not correct because ksm zero pages is not counted in. So update the calculation of KSM profit including the documentation. Link: https://lkml.kernel.org/r/20230613030942.186041-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Xiaokai Ran <ran.xiaokai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: add ksm zero pages for each processxu xin6-8/+17
As the number of ksm zero pages is not included in ksm_merging_pages per process when enabling use_zero_pages, it's unclear of how many actual pages are merged by KSM. To let users accurately estimate their memory demands when unsharing KSM zero-pages, it's necessary to show KSM zero- pages per process. In addition, it help users to know the actual KSM profit because KSM-placed zero pages are also benefit from KSM. since unsharing zero pages placed by KSM accurately is achieved, then tracking empty pages merging and unmerging is not a difficult thing any longer. Since we already have /proc/<pid>/ksm_stat, just add the information of 'ksm_zero_pages' in it. Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: count all zero pages placed by KSMxu xin5-1/+37
As pages_sharing and pages_shared don't include the number of zero pages merged by KSM, we cannot know how many pages are zero pages placed by KSM when enabling use_zero_pages, which leads to KSM not being transparent with all actual merged pages by KSM. In the early days of use_zero_pages, zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so it's hard to count how many times one of those zeropages was then unmerged. But now, unsharing KSM-placed zero page accurately has been achieved, so we can easily count both how many times a page full of zeroes was merged with zero-page and how many times one of those pages was then unmerged. and so, it helps to estimate memory demands when each and every shared page could get unshared. So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number of all zero pages placed by KSM. Meanwhile, we update the Documentation. Link: https://lkml.kernel.org/r/20230613030934.185944-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: support unsharing KSM-placed zero pagesxu xin2-3/+14
Patch series "ksm: support tracking KSM-placed zero-pages", v10. The core idea of this patch set is to enable users to perceive the number of any pages merged by KSM, regardless of whether use_zero_page switch has been turned on, so that users can know how much free memory increase is really due to their madvise(MERGEABLE) actions. But the problem is, when enabling use_zero_pages, all empty pages will be merged with kernel zero pages instead of with each other as use_zero_pages is disabled, and then these zero-pages are no longer monitored by KSM. The motivations to do this is seen at: https://lore.kernel.org/lkml/202302100915227721315@zte.com.cn/ In one word, we hope to implement the support for KSM-placed zero pages tracking without affecting the feature of use_zero_pages, so that app developer can also benefit from knowing the actual KSM profit by getting KSM-placed zero pages to optimize applications eventually when /sys/kernel/mm/ksm/use_zero_pages is enabled. This patch (of 5): When use_zero_pages of ksm is enabled, madvise(addr, len, MADV_UNMERGEABLE) and other ways (like write 2 to /sys/kernel/mm/ksm/run) to trigger unsharing will *not* actually unshare the shared zeropage as placed by KSM (which is against the MADV_UNMERGEABLE documentation). As these KSM-placed zero pages are out of the control of KSM, the related counts of ksm pages don't expose how many zero pages are placed by KSM (these special zero pages are different from those initially mapped zero pages, because the zero pages mapped to MADV_UNMERGEABLE areas are expected to be a complete and unshared page). To not blindly unshare all shared zero_pages in applicable VMAs, the patch use pte_mkdirty (related with architecture) to mark KSM-placed zero pages. Thus, MADV_UNMERGEABLE will only unshare those KSM-placed zero pages. In addition, we'll reuse this mechanism to reliably identify KSM-placed ZeroPages to properly account for them (e.g., calculating the KSM profit that includes zeropages) in the latter patches. The patch will not degrade the performance of use_zero_pages as it doesn't change the way of merging empty pages in use_zero_pages's feature. Link: https://lkml.kernel.org/r/202306131104554703428@zte.com.cn Link: https://lkml.kernel.org/r/20230613030928.185882-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/migrate_device: try to handle swapcache pagesMika Penttilä1-6/+15
Migrating file pages and swapcache pages into device memory is not supported. Try to get rid of the swap cache, and if successful, go ahead as with other anonymous pages. Link: https://lkml.kernel.org/r/20230607172944.11713-1-mpenttil@redhat.com Signed-off-by: Mika Penttilä <mpenttil@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests: cgroup: add zswap-memcg unwanted writeback testDomenico Cerasuolo1-0/+61
Add a test to verify that when a memcg hits its limit in zswap, it doesn't trigger an unwanted writeback that would result in pages not owned by that memcg to be sent to disk, even if zswap isn't full. This was fixed by commit 0bdf0efa180a("zswap: do not shrink if cgroup may not zswap"). Link: https://lkml.kernel.org/r/20230621153548.428093-4-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests: cgroup: add test_zswap with no kmem bypass testDomenico Cerasuolo1-0/+163
Add a cgroup selftest that verifies memcg charging in zswap. The original issue was that kmem bypass was applied to pages swapped out to zswap by kswapd, resulting in zswapped memory not being charged. It was fixed by commit cd08d80ecdac("mm: correctly charge compressed memory to its memcg"). Link: https://lkml.kernel.org/r/20230621153548.428093-3-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests: cgroup: add test_zswap programDomenico Cerasuolo3-0/+65
Patch series "selftests: cgroup: add zswap test program". This series adds 2 zswap related selftests that verify known and fixed issues. A new dedicated test program (test_zswap) is proposed since the test cases are specific to zswap and hosts specific helpers. The first patch adds the (empty) test program, while the other 2 add an actual test function each. This patch (of 3): Add empty cgroup-zswap self test scaffold program, test functions to be added in the next commits. Link: https://lkml.kernel.org/r/20230621153548.428093-1-cerasuolodomenico@gmail.com Link: https://lkml.kernel.org/r/20230621153548.428093-2-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/page_alloc: use write_seqlock_irqsave() instead write_seqlock() + ↵Sebastian Andrzej Siewior1-9/+6
local_irq_save(). __build_all_zonelists() acquires zonelist_update_seq by first disabling interrupts via local_irq_save() and then acquiring the seqlock with write_seqlock(). This is troublesome and leads to problems on PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping lock on PREEMPT_RT and must not be acquired with disabled interrupts. The API provides write_seqlock_irqsave() which does the right thing in one step. printk_deferred_enter() has to be invoked in non-migrate-able context to ensure that deferred printing is enabled and disabled on the same CPU. This is the case after zonelist_update_seq has been acquired. There was discussion on the first submission that the order should be: local_irq_disable(); printk_deferred_enter(); write_seqlock(); to avoid pitfalls like having an unaccounted printk() coming from write_seqlock_irqsave() before printk_deferred_enter() is invoked. The only origin of such a printk() can be a lockdep splat because the lockdep annotation happens after the sequence count is incremented. This is exceptional and subject to change. It was also pointed that PREEMPT_RT can be affected by the printk problem since its write_seqlock_irqsave() does not really disable interrupts. This isn't the case because PREEMPT_RT's printk implementation differs from the mainline implementation in two important aspects: - Printing happens in a dedicated threads and not at during the invocation of printk(). - In emergency cases where synchronous printing is used, a different driver is used which does not use tty_port::lock. Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer printk output. Link: https://lkml.kernel.org/r/20230623201517.yw286Knb@linutronix.de Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: remove zs_compact_controlMinchan Kim1-28/+9
__zs_compact always putback src_zspage into class list after migrate_zspage. Thus, we don't need to keep last position of src_zspage any more. Let's remove it. Link: https://lkml.kernel.org/r/20230624053120.643409-4-senozhatsky@chromium.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: move migration destination zspage inuse checkSergey Senozhatsky1-4/+4
Destination zspage fullness check need to be done after zs_object_copy() because that's where source and destination zspages fullness change. Checking destination zspage fullness before zs_object_copy() may cause migration to loop through source zspage sub-pages scanning for allocate objects just to find out at the end that the destination zspage is full. Link: https://lkml.kernel.org/r/20230624053120.643409-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: do not scan for allocated objects in empty zspageSergey Senozhatsky1-0/+9
Patch series "zsmalloc: small compaction improvements", v2. A tiny series that can reduce the number of find_alloced_obj() invocations (which perform a linear scan of sub-page) during compaction. Inspired by Alexey Romanov's findings. This patch (of 3): zspage migration can terminate as soon as it moves the last allocated object from the source zspage. Add a simple helper zspage_empty() that tests zspage ->inuse on each migration iteration. Link: https://lkml.kernel.org/r/20230624053120.643409-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Alexey Romanov <AVRomanov@sberdevices.ru> Reviewed-by: Alexey Romanov <avromanov@sberdevices.ru> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/mm_init.c: remove obsolete macro HASH_SMALLMiaohe Lin3-14/+3
HASH_SMALL only works when parameter numentries is 0. But the sole caller futex_init() never calls alloc_large_system_hash() with numentries set to 0. So HASH_SMALL is obsolete and remove it. Link: https://lkml.kernel.org/r/20230625021323.849147-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: André Almeida <andrealmeid@igalia.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLEliuq1-6/+6
The current calculation of min_free_kbytes only uses ZONE_DMA and ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will also divide part of min_free_kbytes.This will cause the min watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE. __GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in __setup_per_zone_wmarks(). On my testing machine with 16GB of memory (transparent hugepage is turned off by default, and movablecore=12G is configured) The following is a comparative test data of watermark_min no patch add patch ZONE_DMA 1 8 ZONE_DMA32 151 709 ZONE_NORMAL 233 1113 ZONE_MOVABLE 1434 128 min_free_kbytes 7288 7326 Link: https://lkml.kernel.org/r/20230625031656.23941-1-liuq131@chinatelecom.cn Signed-off-by: liuq <liuq131@chinatelecom.cn> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18fs: convert block_commit_write to return voidBean Huo5-23/+13
block_commit_write() always returns 0, this patch changes it to return void. Link: https://lkml.kernel.org/r/20230626055518.842392-3-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18fs/buffer: clean up block_commit_writeBean Huo1-7/+4
Originally inode is used to get blksize, after commit 45bce8f3e343 ("fs/buffer.c: make block-size be per-page and protected by the page lock"), __block_commit_write no longer uses this parameter inode. [akpm@linux-foundation.org: remove now-unused local `inode'] Link: https://lkml.kernel.org/r/20230626055518.842392-2-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: memory-failure: remove unneeded 'inline' annotationMiaohe Lin1-2/+2
Remove unneeded 'inline' annotation from num_poisoned_pages_inc() and num_poisoned_pages_sub(). No functional change intended. Link: https://lkml.kernel.org/r/20230626114343.1846587-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18memory tier: use helper function destroy_memory_type()Miaohe Lin1-1/+1
Use helper function destroy_memory_type() to release memtype instead of open code it to help improve code readability a bit. No functional change intended. Link: https://lkml.kernel.org/r/20230626121053.1916447-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: memory-failure: remove unneeded page state check in shake_page()Miaohe Lin1-7/+4
Remove unneeded PageLRU(p) and is_free_buddy_page(p) check as slab caches are not shrunk now. This check can be added back when a lightweight range based shrinker is available. Link: https://lkml.kernel.org/r/20230628014929.3441386-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: add a fast path case in mas_wr_slot_store()Peng Zhang1-12/+24
When expanding a range in two directions, only partially overwriting the previous and next ranges, the number of entries will not be increased, so we can just update the pivots as a fast path. However, it may introduce potential risks in RCU mode, because it updates two pivots. We only enable it in non-RCU mode. Link: https://lkml.kernel.org/r/20230628073657.75314-5-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: optimize mas_wr_append(), also improve duplicating VMAsPeng Zhang1-11/+22
When the new range can be completely covered by the original last range without touching the boundaries on both sides, two new entries can be appended to the end as a fast path. We update the original last pivot at the end, and the newly appended two entries will not be accessed before this, so it is also safe in RCU mode. This is useful for sequential insertion, which is what we do in dup_mmap(). Enabling BENCH_FORK in test_maple_tree and just running bench_forking() gives the following time-consuming numbers: before: after: 17,874.83 msec 15,738.38 msec It shows about a 12% performance improvement for duplicating VMAs. Link: https://lkml.kernel.org/r/20230628073657.75314-4-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: add test for expanding range in RCU modePeng Zhang1-0/+75
Add test for expanding range in RCU mode. If we use the fast path of the slot store to expand range in RCU mode, this test will fail. Link: https://lkml.kernel.org/r/20230628073657.75314-3-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: add test for mas_wr_modify() fast pathPeng Zhang1-0/+65
Patch series "Optimize the fast path of mas_store()", v4. Add fast paths for mas_wr_append() and mas_wr_slot_store() respectively. The newly added fast path of mas_wr_append() is used in fork() and how much it benefits fork() depends on how many VMAs are duplicated. Thanks Liam for the review. This patch (of 4): Add tests for all cases of mas_wr_append() and mas_wr_slot_store(). Link: https://lkml.kernel.org/r/20230628073657.75314-1-zhangpeng.00@bytedance.com Link: https://lkml.kernel.org/r/20230628073657.75314-2-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/filemap.c: fix update prev_pos after one read request doneHaibo Li1-4/+5
ra->prev_pos tracks the last visited byte in the previous read request. It is used to check whether it is sequential read in ondemand_readahead and thus affects the readahead window. After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now uses find_get_pages_contig"), update logic of prev_pos is changed. It updates prev_pos after each return from filemap_get_pages(). But the read request from user may be not fully completed at this point. The updated prev_pos impacts the subsequent readahead window. The real problem is performance drop of fsck_msdos between linux-5.4 and linux-5.15(also linux-6.4). Comparing to linux-5.4,It spends about 110% time and read 140% pages. The read pattern of fsck_msdos is not fully sequential. Simplified read pattern of fsck_msdos likes below: 1.read at page offset 0xa,size 0x1000 2.read at other page offset like 0x20,size 0x1000 3.read at page offset 0xa,size 0x4000 4.read at page offset 0xe,size 0x1000 Here is the read status on linux-6.4: 1.after read at page offset 0xa,size 0x1000 ->page ofs 0xa go into pagecache 2.after read at page offset 0x20,size 0x1000 ->page ofs 0x20 go into pagecache 3.read at page offset 0xa,size 0x4000 ->filemap_get_pages read ofs 0xa from pagecache and returns ->prev_pos is updated to 0xb and goto next loop ->filemap_get_pages tends to read ofs 0xb,size 0x3000 ->initial_readahead case in ondemand_readahead since prev_pos is the same as request ofs. ->read 8 pages while async size is 5 pages (PageReadahead flag at page 0xe) 4.read at page offset 0xe,size 0x1000 ->hit page 0xe with PageReadahead flag set,double the ra_size. read 16 pages while async size is 16 pages Now it reads 24 pages while actually uses 5 pages on linux-5.4: 1.the same as 6.4 2.the same as 6.4 3.read at page offset 0xa,size 0x4000 ->read ofs 0xa from pagecache ->read ofs 0xb,size 0x3000 using page_cache_sync_readahead read 3 pages ->prev_pos is updated to 0xd before generic_file_buffered_read returns 4.read at page offset 0xe,size 0x1000 ->initial_readahead case in ondemand_readahead since request ofs-prev_pos==1 ->read 4 pages while async size is 3 pages Now it reads 7 pages while actually uses 5 pages. In above demo, the initial_readahead case is triggered by offset of user request on linux-5.4. While it may be triggered by update logic of prev_pos on linux-6.4. To fix the performance drop, update prev_pos after finishing one read request. Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com Signed-off-by: Haibo Li <haibo.li@mediatek.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests/mm: add gup test matrix in run_vmtests.shPeter Xu1-5/+32
Add a matrix for testing gup based on the current gup_test. Only run the matrix when -a is specified because it's a bit slow. It covers: - Different types of huge pages: thp, hugetlb, or no huge page - Permissions: Write / Read-only - Fast-gup, with/without - Types of the GUP: pin / gup / longterm pins - Shared / Private memories - GUP size: 1 / 512 / random page sizes Link: https://lkml.kernel.org/r/20230628215310.73782-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18selftests/mm: add -a to run_vmtests.shPeter Xu1-3/+8
Allows to specify optional tests in run_vmtests.sh, where we can run time consuming test matrix only when user specified "-a". Link: https://lkml.kernel.org/r/20230628215310.73782-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: retire follow_hugetlb_page()Peter Xu4-256/+1
Now __get_user_pages() should be well prepared to handle thp completely, as long as hugetlb gup requests even without the hugetlb's special path. Time to retire follow_hugetlb_page(). Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal. Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: accelerate thp gup even for "pages != NULL"Peter Xu1-7/+44
The acceleration of THP was done with ctx.page_mask, however it'll be ignored if **pages is non-NULL. The old optimization was introduced in 2013 in 240aadeedc4a ("mm: accelerate mm_populate() treatment of THP pages"). It didn't explain why we can't optimize the **pages non-NULL case. It's possible that at that time the major goal was for mm_populate() which should be enough back then. Optimize thp for all cases, by properly looping over each subpage, doing cache flushes, and boost refcounts / pincounts where needed in one go. This can be verified using gup_test below: # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 Before: 13992.50 ( +-8.75%) After: 378.50 (+-69.62%) Link: https://lkml.kernel.org/r/20230628215310.73782-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: cleanup next_page handlingPeter Xu1-4/+3
The only path that doesn't use generic "**pages" handling is the gate vma. Make it use the same path, meanwhile tune the next_page label upper to cover "**pages" handling. This prepares for THP handling for "**pages". Link: https://lkml.kernel.org/r/20230628215310.73782-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>