summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-04-26mm: percpu: introduce pcpuobj_extKent Overstreet2-17/+32
Upcoming alloc tagging patches require a place to stash per-allocation metadata. We already do this when memcg is enabled, so this patch generalizes the obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext type, which we will be adding to in an upcoming patch - similarly to the previous slabobj_ext patch. Link: https://lkml.kernel.org/r/20240321163705.3067592-28-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: linux-mm@kvack.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mempool: hook up to memory allocation profilingKent Overstreet2-59/+50
This adds hooks to mempools for correctly annotating mempool-backed allocations at the correct source line, so they show up correctly in /sys/kernel/debug/allocations. Various inline functions are converted to wrappers so that we can invoke alloc_hooks() in fewer places. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-4-surenb@google.com [surenb@google.com: add missing mempool_create_node documentation] Link: https://lkml.kernel.org/r/20240402180835.1661905-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-27-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/slab: enable slab allocation tagging for kmalloc and friendsSuren Baghdasaryan6-123/+127
Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations and deallocations done by these functions. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-7-surenb@google.com [rdunlap@infradead.org: fix kcalloc() kernel-doc warnings] Link: https://lkml.kernel.org/r/20240327044649.9199-1-rdunlap@infradead.org Link: https://lkml.kernel.org/r/20240321163705.3067592-26-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26rust: add a rust helper for krealloc()Kent Overstreet1-0/+8
Memory allocation profiling is turning krealloc() into a nontrivial macro - so for now, we need a helper for it. Until we have proper support on the rust side for memory allocation profiling this does mean that all Rust allocations will be accounted to the helper. Link: https://lkml.kernel.org/r/20240321163705.3067592-25-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/slab: add allocation accounting into slab allocation and free pathsSuren Baghdasaryan1-1/+95
Account slab allocations using codetag reference embedded into slabobj_ext. Link: https://lkml.kernel.org/r/20240321163705.3067592-24-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: add codetag reference into slabobj_extSuren Baghdasaryan2-0/+6
To store code tag for every slab object, a codetag reference is embedded into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Link: https://lkml.kernel.org/r/20240321163705.3067592-23-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=ySuren Baghdasaryan1-0/+9
For all page allocations to be tagged, page_ext has to be initialized before the first page allocation. Early tasks allocate their stacks using page allocator before alloc_node_page_ext() initializes page_ext area, unless early_page_ext is enabled. Therefore these allocations will generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled. Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to ensure page_ext initialization prior to any page allocation. This will have all the negative effects associated with early_page_ext, such as possible longer boot time, therefore we enable it only when debugging with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for CONFIG_MEM_ALLOC_PROFILING. Link: https://lkml.kernel.org/r/20240321163705.3067592-22-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: fix non-compound multi-order memory accounting in __free_pagesSuren Baghdasaryan2-1/+28
When a non-compound multi-order page is freed, it is possible that a speculative reference keeps the page pinned. In this case we free all pages except for the first page, which will be freed later by the last put_page(). However the page passed to put_page() is indistinguishable from an order-0 page, so it cannot do the accounting, just as it cannot free the subsequent pages. Do the accounting here, where we free the pages. Link: https://lkml.kernel.org/r/20240321163705.3067592-21-surenb@google.com Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: create new codetag references during page splittingSuren Baghdasaryan4-0/+43
When a high-order page is split into smaller ones, each newly split page should get its codetag. After the split each split page will be referencing the original codetag. The codetag's "bytes" counter remains the same because the amount of allocated memory has not changed, however the "calls" counter gets increased to keep the counter correct when these individual pages get freed. Link: https://lkml.kernel.org/r/20240321163705.3067592-20-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: enable page allocation taggingSuren Baghdasaryan7-103/+157
Redefine page allocators to record allocation tags upon their invocation. Instrument post_alloc_hook and free_pages_prepare to modify current allocation tag. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-3-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-19-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26change alloc_pages name in dma_map_ops to avoid name conflictsSuren Baghdasaryan13-15/+15
After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation tagsSuren Baghdasaryan1-0/+4
As each allocation tag generates a per-cpu variable, more space is required to store them. Increase PERCPU_MODULE_RESERVE to provide enough area. A better long-term solution would be to allocate this memory dynamically. [surenb@google.com: increase PERCPU_MODULE_RESERVE to accommodate allocation tags] Link: https://lkml.kernel.org/r/20240406214044.1114406-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-17-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Tested-by: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: introduce early boot parameter to avoid page_ext memory overheadSuren Baghdasaryan1-1/+40
The highest memory overhead from memory allocation profiling comes from page_ext objects. This overhead exists even if the feature is disabled but compiled-in. To avoid it, introduce an early boot parameter that prevents page_ext object creation. The new boot parameter is a tri-state with possible values of 0|1|never. When it is set to "never" the memory allocation profiling support is disabled, and overhead is minimized (currently no page_ext objects are allocated, in the future more overhead might be eliminated). As a result we also lose ability to enable memory allocation profiling at runtime (because there is no space to store alloctag references). Runtime sysctrl becomes read-only if the early boot parameter was set to "never". Note that the default value of this boot parameter depends on the CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT configuration. When CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n the boot parameter is set to "never", therefore eliminating any overhead. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y results in boot parameter being set to 1 (enabled). This allows distributions to avoid any overhead by setting CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n config and with no changes to the kernel command line. We reuse sysctl.vm.mem_profiling boot parameter name in order to avoid introducing yet another control. This change turns it into a tri-state early boot parameter. Link: https://lkml.kernel.org/r/20240321163705.3067592-16-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: introduce support for page allocation taggingSuren Baghdasaryan7-1/+105
Introduce helper functions to easily instrument page allocators by storing a pointer to the allocation tag associated with the code that allocated the page in a page_ext field. Link: https://lkml.kernel.org/r/20240321163705.3067592-15-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: add allocation tagging support for memory allocation profilingSuren Baghdasaryan10-0/+428
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: prevent module unloading if memory is not freedSuren Baghdasaryan3-14/+30
Skip freeing module's data section if there are non-zero allocation tags because otherwise, once these allocations are freed, the access to their code tag would cause UAF. Link: https://lkml.kernel.org/r/20240321163705.3067592-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: code tagging module supportSuren Baghdasaryan3-2/+72
Add support for code tagging from dynamically loaded modules. Link: https://lkml.kernel.org/r/20240321163705.3067592-12-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib: code tagging frameworkSuren Baghdasaryan4-0/+292
Add basic infrastructure to support code tagging which stores tag common information consisting of the module name, function, file name and line number. Provide functions to register a new code tag type and navigate between code tags. Link: https://lkml.kernel.org/r/20240321163705.3067592-11-surenb@google.com Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26slab: objext: introduce objext_flags as extension to page_memcg_data_flagsSuren Baghdasaryan2-11/+23
Introduce objext_flags to store additional objext flags unrelated to memcg. Link: https://lkml.kernel.org/r/20240321163705.3067592-10-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creationSuren Baghdasaryan2-2/+13
Slab extension objects can't be allocated before slab infrastructure is initialized. Some caches, like kmem_cache and kmem_cache_node, are created before slab infrastructure is initialized. Objects from these caches can't have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these caches and avoid creating extensions for objects allocated from these slabs. Link: https://lkml.kernel.org/r/20240321163705.3067592-9-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creationSuren Baghdasaryan2-0/+13
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations when allocating slabobj_ext on a slab. Link: https://lkml.kernel.org/r/20240321163705.3067592-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: introduce slabobj_ext to support slab object extensionsSuren Baghdasaryan9-112/+145
Currently slab pages can store only vectors of obj_cgroup pointers in page->memcg_data. Introduce slabobj_ext structure to allow more data to be stored for each slab object. Wrap obj_cgroup into slabobj_ext to support current functionality while allowing to extend slabobj_ext in the future. Link: https://lkml.kernel.org/r/20240321163705.3067592-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26fs: convert alloc_inode_sb() to a macroKent Overstreet1-5/+1
We're introducing alloc tagging, which tracks memory allocations by callsite. Converting alloc_inode_sb() to a macro means allocations will be tracked by its caller, which is a bit more useful. Link: https://lkml.kernel.org/r/20240321163705.3067592-6-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26scripts/kallysms: always include __start and __stop symbolsKent Overstreet1-0/+13
These symbols are used to denote section boundaries: by always including them we can unify loading sections from modules with loading built-in sections, which leads to some significant cleanup. Link: https://lkml.kernel.org/r/20240321163705.3067592-5-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/slub: mark slab_free_freelist_hook() __always_inlineKent Overstreet1-3/+3
It seems we need to be more forceful with the compiler on this one. This is done for performance reasons only. Link: https://lkml.kernel.org/r/20240321163705.3067592-4-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26asm-generic/io.h: kill vmalloc.h dependencyKent Overstreet1-1/+0
Needed to avoid a new circular dependency with the memory allocation profiling series. Naturally, a whole bunch of files needed to include vmalloc.h that were previously getting it implicitly. Link: https://lkml.kernel.org/r/20240321163705.3067592-3-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26fix missing vmalloc.h includesKent Overstreet56-0/+59
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26scripts/kernel-doc: drop "_noprof" on function prototypesRandy Dunlap1-0/+1
Memory profiling introduces macros as hooks for function-level allocation profiling[1]. Memory allocation functions that are profiled are named like xyz_alloc() for API access to the function. xyz_alloc() then calls xyz_alloc_noprof() to do the allocation work. The kernel-doc comments for the memory allocation functions are introduced with the xyz_alloc() function names but the function implementations are the xyz_alloc_noprof() names. This causes kernel-doc warnings for mismatched documentation and function prototype names. By dropping the "_noprof" part of the function name, the kernel-doc function name matches the function prototype name, so the warnings are resolved. [1] https://lore.kernel.org/all/20240321163705.3067592-1-surenb@google.com/ Link: https://lkml.kernel.org/r/20240326054149.2121-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20240325123603.1bdd6588@canb.auug.org.au/ Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26percpu: clean up all mappings when pcpu_map_pages() failsYosry Ahmed1-2/+2
In pcpu_map_pages(), if __pcpu_map_pages() fails on a CPU, we call __pcpu_unmap_pages() to clean up mappings on all CPUs where mappings were created, but not on the CPU where __pcpu_map_pages() fails. __pcpu_map_pages() and __pcpu_unmap_pages() are wrappers around vmap_pages_range_noflush() and vunmap_range_noflush(). All other callers of vmap_pages_range_noflush() call vunmap_range_noflush() when mapping fails, except pcpu_map_pages(). The reason could be that partial mappings may be left behind from a failed mapping attempt. Call __pcpu_unmap_pages() for the failed CPU as well in pcpu_map_pages(). This was found by code inspection, no failures or bugs were observed. Link: https://lkml.kernel.org/r/20240311194346.2291333-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter (Ampere) <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/numa_balancing: allow migrate on protnone reference with ↵Donet Tom1-5/+17
MPOL_PREFERRED_MANY policy commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") added support for migrate on protnone reference with MPOL_BIND memory policy. This allowed numa fault migration when the executing node is part of the policy mask for MPOL_BIND. This patch extends migration support to MPOL_PREFERRED_MANY policy. Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag MPOL_F_NUMA_BALANCING. This causes issues when we want to use NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, the kernel should not allocate pages from the slower memory tier via allocation control zonelist fallback. Instead, we should move cold pages from the faster memory node via memory demotion. For a page allocation, kswapd is only woken up after we try to allocate pages from all nodes in the allocation zone list. This implies that, without using memory policies, we will end up allocating hot pages in the slower memory tier. MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better allocation control when we have memory tiers in the system. With MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only of faster memory nodes. When we fail to allocate pages from the faster memory node, kswapd would be woken up, allowing demotion of cold pages to slower memory nodes. With the current kernel, such usage of memory policies implies we can't do page promotion from a slower memory tier to a faster memory tier using numa fault. This patch fixes this issue. For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask, we allow numa migration to the executing nodes. If the executing node is not in the policy node mask, we do not allow numa migration. Example: On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0, N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while N2 has slow memory and no CPU. For a workload, we may use MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload runs on CPUs of socket 0 at most times. Then, even if the workload runs on CPUs of N3 occasionally, we will not try to migrate the workload pages from N2 to N3 because users may want to avoid cross-socket access as much as possible in the long term. In below table, Process is the Process executing node and Curr Loc Pgs is the numa node where page present(folio node) =========================================================== Process Policy Curr Loc Pgs Observation ----------------------------------------------------------- N0 N0 N1 N1 Pages Migrated from N1 to N0 N0 N0 N1 N2 Pages Migrated from N2 to N0 N0 N0 N1 N3 Pages Migrated from N3 to N0 N3 N0 N1 N0 Pages NOT Migrated to N3 N3 N0 N1 N1 Pages NOT Migrated to N3 N3 N0 N1 N2 Pages NOT Migrated to N3 ------------------------------------------------------------ Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/mempolicy: use numa_node_id() instead of cpu_to_node()Donet Tom5-11/+20
Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy:, v4. This patchset is to optimize the cross-socket memory access with MPOL_PREFERRED_MANY policy. To test this patch we ran the following test on a 3 node system. Node 0 - 2GB - Tier 1 Node 1 - 11GB - Tier 1 Node 6 - 10GB - Tier 2 Below changes are made to memcached to set the memory policy, It select Node0 and Node1 as preferred nodes. #include <numaif.h> #include <numa.h> unsigned long nodemask; int ret; nodemask = 0x03; ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING, &nodemask, 10); /* If MPOL_F_NUMA_BALANCING isn't supported, * fall back to MPOL_PREFERRED_MANY */ if (ret < 0 && errno == EINVAL){ printf("set mem policy normal\n"); ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10); } if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } Test Procedure: =============== 1. Make sure memory tiring and demotion are enabled. 2. Start memcached. # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7 -d -s "/tmp/memcached.sock" 3. Run memtier_benchmark to store 3200000 keys. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1 --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024 4. Start a memory eater on node 0 and 1. This will demote all memcached pages to node 6. 5. Make sure all the memcached pages got demoted to lower tier by reading /proc/<memcaced PID>/numa_maps. # cat /proc/2771/numa_maps --- default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 --- 6. Kill memory eater. 7. Read the pgpromote_success counter. 8. Start reading the keys by running memtier_benchmark. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R --key-minimum=1 --key-maximum=3200000 -n allkeys --threads=64 -c 1 -R -x 6 9. Read the pgpromote_success counter. Test Results: ============= Without Patch ------------------ 1. pgpromote_success before test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 pgpromote_success after test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency ------------------------------------------------------------------ Sets 0.00 --- --- --- --- Gets 305792.03 305791.93 0.10 0.18949 0.16700 Waits 0.00 --- --- --- --- Totals 305792.03 305791.93 0.10 0.18949 0.16700 ====================================== p99 Latency p99.9 Latency KB/sec ------------------------------------- --- --- 0.00 0.44700 1.71100 11542.69 --- --- --- 0.44700 1.71100 11542.69 With Patch --------------- 1. pgpromote_success before test Node 0: pgpromote_success 5 Node 1: pgpromote_success 89386 pgpromote_success after test Node 0: pgpromote_success 57895 Node 1: pgpromote_success 141463 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ==================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency -------------------------------------------------------------------- Sets 0.00 --- --- --- --- Gets 521942.24 521942.07 0.17 0.11459 0.10300 Waits 0.00 --- --- --- --- Totals 521942.24 521942.07 0.17 0.11459 0.10300 ======================================= p99 Latency p99.9 Latency KB/sec --------------------------------------- --- --- 0.00 0.23100 0.31900 19701.68 --- --- --- 0.23100 0.31900 19701.68 Test Result Analysis: ===================== 1. With patch we could observe pages are getting promoted. 2. Memtier-benchmark results shows that, with the patch, performance has increased more than 50%. Ops/sec without fix - 305792.03 Ops/sec with fix - 521942.24 This patch (of 2): Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is quicker. smp_processor_id is guaranteed to be stable in the 'mpol_misplaced()' function because it is called with ptl held. lockdep_assert_held was added to ensure that. No functional change in this patch. [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment] Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: zswap: remove unnecessary check in zswap_find_zpool()Yosry Ahmed1-6/+1
zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true. This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a config option and never made it upstream. Remove the unnecessary check. Link: https://lkml.kernel.org/r/20240311235210.2937484-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26lib/test_hmm.c: handle src_pfns and dst_pfns allocation failureDuoming Zhou1-4/+4
The kcalloc() in dmirror_device_evict_chunk() will return null if the physical memory has run out. As a result, if src_pfns or dst_pfns is dereferenced, the null pointer dereference bug will happen. Moreover, the device is going away. If the kcalloc() fails, the pages mapping a chunk could not be evicted. So add a __GFP_NOFAIL flag in kcalloc(). Finally, as there is no need to have physically contiguous memory, Switch kcalloc() to kvcalloc() in order to avoid failing allocations. Link: https://lkml.kernel.org/r/20240312005905.9939-1-duoming@zju.edu.cn Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: zpool: return pool size in pagesJohannes Weiner6-23/+23
All zswap backends track their pool sizes in pages. Currently they multiply by PAGE_SIZE for zswap, only for zswap to divide again in order to do limit math. Report pages directly. Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: zswap: optimize zswap pool size trackingJohannes Weiner3-54/+52
Profiling the munmap() of a zswapped memory region shows 60% of the total cycles currently going into updating the zswap_pool_total_size. There are three consumers of this counter: - store, to enforce the globally configured pool limit - meminfo & debugfs, to report the size to the user - shrink, to determine the batch size for each cycle Instead of aggregating everytime an entry enters or exits the zswap pool, aggregate the value from the zpools on-demand: - Stores aggregate the counter anyway upon success. Aggregating to check the limit instead is the same amount of work. - Meminfo & debugfs might benefit somewhat from a pre-aggregated counter, but aren't exactly hotpaths. - Shrinking can aggregate once for every cycle instead of doing it for every freed entry. As the shrinker might work on tens or hundreds of objects per scan cycle, this is a large reduction in aggregations. The paths that benefit dramatically are swapin, swapoff, and unmaps. There could be millions of pages being processed until somebody asks for the pool size again. This eliminates the pool size updates from those paths entirely. Top profile entries for a 24G range munmap(), before: 38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size 12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size 9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size 2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free 2.86% zswap-unmap [kernel.kallsyms] [k] xas_store and after: 7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free 7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 6.74% zswap-unmap [kernel.kallsyms] [k] xas_store It was also briefly considered to move to a single atomic in zswap that is updated by the backends, since zswap only cares about the sum of all pools anyway. However, zram directly needs per-pool information out of zsmalloc. To keep the backend from having to update two atomics every time, I opted for the lazy aggregation instead for now. Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: document pXd_leaf() APIPeter Xu1-5/+19
There's one small section already, but since we're going to remove pXd_huge(), that comment may start to obsolete. Rewrite that section with more information, hopefully with that the API is crystal clear on what it implies. Link: https://lkml.kernel.org/r/20240318200404.448346-15-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/arm: remove pmd_thp_or_huge()Peter Xu4-6/+2
ARM/ARM64 used to define pmd_thp_or_huge(). Now this macro is completely redundant. Remove it and use pmd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: remove pXd_huge()Peter Xu17-198/+2
This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: replace pXd_huge() with pXd_leaf()Peter Xu10-15/+15
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/gup: merge pXd huge mapping checksPeter Xu1-4/+3
Huge mapping checks in GUP are slightly redundant and can be simplified. pXd_huge() now is the same as pXd_leaf(). pmd_trans_huge() and pXd_devmap() should both imply pXd_leaf(). Time to merge them into one. Link: https://lkml.kernel.org/r/20240318200404.448346-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/powerpc: redefine pXd_huge() with pXd_leaf()Peter Xu2-27/+14
PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so it will keep returning false. As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf hugetlb mappings. The goal should be that we will have one API pXd_leaf() to detect all kinds of huge mappings (hugepd is still special in this case, though). AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to make sure ie. THPs on hash MMU will also return true. This helps to simplify a follow up patch to drop pXd_huge() treewide. NOTE: *_leaf() definition need to be moved before the inclusion of asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it. [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse Link: https://lkml.kernel.org/r/20240318200404.448346-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/arm64: merge pXd_huge() and pXd_leaf() definitionsPeter Xu2-6/+6
Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly differently. Redefine the pXd_huge() with pXd_leaf(). There used to be two traps for old aarch64 definitions over these APIs that I found when reading the code around, they're: (1) 4797ec2dc83a ("arm64: fix pud_huge() for 2-level pagetables") (2) 23bc8f69f0ec ("arm64: mm: fix p?d_leaf()") Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a problem (on PROT_NONE checks). To make sure it also works for (1), we move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to constantly returning "false" for 2-level pgtables, which looks even safer to cover both now. Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/arm: redefine pmd_huge() with pmd_leaf()Peter Xu1-6/+1
Most of the archs already define these two APIs the same way. ARM is more complicated in two aspects: - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for pXd_leaf() it's always checking against PXD_TYPE_SECT. - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM pgtables, which makes the whole thing even harder to follow. Luckily, the second complexity should be hidden by the pmd_leaf() implementation against 2-level v.s. 3-level headers. Invoke pmd_leaf() directly for pmd_huge(), to remove the first part of complexity. This prepares to drop pXd_huge() API globally. When at it, drop the obsolete comments - it's outdated. Link: https://lkml.kernel.org/r/20240318200404.448346-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/arm: use macros to define pmd/pud helpersPeter Xu3-4/+5
It's already confusing that ARM 2-level v.s. 3-level defines SECT bit differently on pmd/puds. Always use a macro which is much clearer. Link: https://lkml.kernel.org/r/20240318200404.448346-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/sparc: change pXd_huge() behavior to exclude swap entriesPeter Xu1-4/+2
Please refer to the previous patch on the reasoning for x86. Now sparc is the only architecture that will allow swap entries to be reported as pXd_huge(). After this patch, all architectures should forbid swap entries in pXd_huge(). [akpm@linux-foundation.org: s/;;/;/, per Muchun] Link: https://lkml.kernel.org/r/20240318200404.448346-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/x86: change pXd_huge() behavior to exclude swap entriesPeter Xu1-14/+4
This patch partly reverts below commits: 3a194f3f8ad0 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry") cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage") Right now, pXd_huge() definition across kernel is unclear. We have two groups that think differently on swap entries: - x86/sparc: Allow pXd_huge() to accept swap entries - all the rest: Doesn't allow pXd_huge() to accept swap entries This is so confusing. Since the sparc helpers seem to be added in 2016, which is after x86's (2015), so sparc could have followed a trend. x86 proposed such swap handling in 2015 to resolve hugetlb swap entries hit in GUP, but now GUP guards swap entries with !pXd_present() in all layers so we should be safe. We should define this API properly, one way or another, rather than keep them defined differently across archs. Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it turns out that I am not the only one thinking so, the question was raised when the current pmd_huge() for x86 was proposed by Ville Syrjälä: https://lore.kernel.org/all/Y2WQ7I4LXh8iUIRd@intel.com/ I might also be missing something obvious, but why is it even necessary to treat PRESENT==0+PSE==0 as a huge entry? It is also questioned when Jason Gunthorpe reviewed the other patchset on swap entry handlings: https://lore.kernel.org/all/20240221125753.GQ13330@nvidia.com/ Revert its meaning back to original. It shouldn't have any functional change as we should be ready with guards on !pXd_present() explicitly everywhere. Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there probably because it was breaking things when 3a194f3f8ad0 was proposed, according to the report here: https://lore.kernel.org/all/Y2LYXItKQyaJTv8j@intel.com/ Now we shouldn't need that. Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/gup: check p4d presence before going onPeter Xu1-2/+2
Currently there should have no p4d swap entries so it may not matter much, however this may help us to rule out swap entries in pXd_huge() API, which will include p4d_huge(). The p4d_present() checks make it 100% clear that we won't rely on p4d_huge() for swap entries. Link: https://lkml.kernel.org/r/20240318200404.448346-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/gup: cache p4d in follow_p4d_mask()Peter Xu1-6/+7
Add a variable to cache p4d in follow_p4d_mask(). It's a good practise to make sure all the following checks will have a consistent view of the entry. Link: https://lkml.kernel.org/r/20240318200404.448346-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/hmm: process pud swap entry without pud_huge()Peter Xu1-6/+1
Swap pud entries do not always return true for pud_huge() for all archs. x86 and sparc (so far) allow it, but all the rest do not accept a swap entry to be reported as pud_huge(). So it's not safe to check swap entries within pud_huge(). Check swap entries before pud_huge(), so it should be always safe. This is the only place in the kernel that (IMHO, wrongly) relies on pud_huge() to return true on pud swap entries. The plan is to cleanup pXd_huge() to only report non-swap mappings for all archs. Link: https://lkml.kernel.org/r/20240318200404.448346-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: page_alloc: control latency caused by zone PCP drainingLucas Stach1-4/+7
Patch series "mm/treewide: Remove pXd_huge() API", v2. In previous work [1], we removed the pXd_large() API, which is arch specific. This patchset further removes the hugetlb pXd_huge() API. Hugetlb was never special on creating huge mappings when compared with other huge mappings. Having a standalone API just to detect such pgtable entries is more or less redundant, especially after the pXd_leaf() API set is introduced with/without CONFIG_HUGETLB_PAGE. When looking at this problem, a few issues are also exposed that we don't have a clear definition of the *_huge() variance API. This patchset started by cleaning these issues first, then replace all *_huge() users to use *_leaf(), then drop all *_huge() code. On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for all the rest archs they're reported "false" instead. This part is done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll leave that to hmm experts to decide. Besides, there are three archs (arm, arm64, powerpc) that have slightly different definitions between the *_huge() v.s. *_leaf() variances. I tackled them separately so that it'll be easier for arch experts to chim in when necessary. This part is done in patch 6-9. The final patches 10-14 do the rest on the final removal, since *_leaf() will be the ultimate API in the future, and we seem to have quite some confusions on how *_huge() APIs can be defined, provide a rich comment for *_leaf() API set to define them properly to avoid future misuse, and hopefully that'll also help new archs to start support huge mappings and avoid traps (like either swap entries, or PROT_NONE entry checks). [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com This patch (of 14): When the complete PCP is drained a much larger number of pages than the usual batch size might be freed at once, causing large IRQ and preemption latency spikes, as they are all freed while holding the pcp and zone spinlocks. To avoid those latency spikes, limit the number of pages freed in a single bulk operation to common batch limits. Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Mark Salter <msalter@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>