summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-02-24Docs/admin-guide/mm/damon/usage: document quota goal metric fileSeongJae Park1-6/+6
Update DAMON usage document for the quota goal target_metric file. [sj@kernel.org: fix a typo on the auto-tuning design reference link] Link: https://lkml.kernel.org/r/20240221170852.55529-3-sj@kernel.org Link: https://lkml.kernel.org/r/20240219194431.159606-18-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24Docs/ABI/damon: document quota goal metric fileSeongJae Park1-0/+6
Update DAMON ABI document for the quota goal target_metric file. Link: https://lkml.kernel.org/r/20240219194431.159606-17-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24Docs/mm/damon/design: document quota goal self-tuningSeongJae Park1-2/+18
Update DAMON design doc to explain the quota goal self-tuning, which can be used by setting the goal's metric to metrics that kernel can self-retrieve. Link: https://lkml.kernel.org/r/20240219194431.159606-16-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/sysfs-schemes: support PSI-based quota auto-tuneSeongJae Park1-2/+40
Extend DAMON sysfs interface to support the PSI-based quota auto-tuning by adding a new file, 'target_metric' under the quota goal directory. Old users don't get any behavioral changes since the default value of the metric is 'user input'. Link: https://lkml.kernel.org/r/20240219194431.159606-15-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: implement PSI metric DAMOS quota goalSeongJae Park2-0/+32
Extend DAMOS quota goal metric with system wide memory pressure stall time. Specifically, the system level 'some' PSI for memory is used. The target value can be set in microseconds. DAMOS measures the increased amount of the PSI metric in last quota_reset_interval and use the ratio of it versus the user-specified target PSI value as the score for the auto-tuning feedback loop. Link: https://lkml.kernel.org/r/20240219194431.159606-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: support multiple metrics for quota goalSeongJae Park3-8/+45
DAMOS quota auto-tuning asks users to assess the current tuned quota and provide the feedback in a manual and repeated way. It allows users generate the feedback from a source that the kernel cannot access, and writing a script or a function for doing the manual and repeated feeding is not a big deal. However, additional works are additional works, and it could be more efficient if DAMOS could do the fetch itself, especially in case of DAMON sysfs interface use case, since it can avoid the context switches between the user-space and the kernel-space, though the overhead would be only trivial in most cases. Also in many cases, feedbacks could be made from kernel-accessible sources, such as PSI, CPU usage, etc. Make the quota goal to support multiple types of metrics including such ones. Link: https://lkml.kernel.org/r/20240219194431.159606-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: let goal specified with only target and current valuesSeongJae Park3-22/+16
DAMOS quota auto-tuning feature let users to set the goal by providing a function for getting the current score of the tuned quota. It allows flexible goal setup, but only simple user-set quota is currently being used. As a result, the only user of the DAMOS quota auto-tuning is using a silly void pointer casting based score value passing function. Simplify the interface and the user code by letting user directly set the target and the current value. Link: https://lkml.kernel.org/r/20240219194431.159606-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: remove ->goal field of damos_quotaSeongJae Park2-20/+9
DAMOS quota auto-tuning feature supports static signle goal and dynamic multiple goals via DAMON kernel API, specifically via ->goal and ->goals fields of damos_quota struct, respectively. All in-tree DAMOS kernel API users are using only the dynamic multiple goals now. Remove the unsued static single goal interface. Link: https://lkml.kernel.org/r/20240219194431.159606-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/sysfs: use only quota->goalsSeongJae Park3-19/+35
DAMON sysfs interface implements multiple quota auto-tuning goals on its level since the DAMOS core logic was supporting only single goal. Now the core logic supports multiple goals on its level. Update DAMON sysfs interface to reuse the core logic and drop unnecessary duplicated multiple goals implementation. Link: https://lkml.kernel.org/r/20240219194431.159606-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: add multiple goals per damos_quota and helpers for thoseSeongJae Park2-7/+88
The feedback-driven DAMOS quota auto-tuning feature allows only single goal to the DAMON kernel API users. The API users could implement multiple goals for the end-users on their level, and that's what DAMON sysfs interface is doing. More DAMON kernel API users such as DAMON_RECLAIM would need to do similar work. To reduce unnecessary future duplciated efforts, support multiple goals from DAMOS core layer. To make the support in minimum non-destructive change, keep the old single goal setup interface, and add multiple goals setup. The single goal will treated as one of the multiple goals, so old API users are not required to make any change. Link: https://lkml.kernel.org/r/20240219194431.159606-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: split out quota goal related fields to a structSeongJae Park3-25/+34
'struct damos_quota' is not small now. Split out fields for quota goal to a separate struct for easier reading. Link: https://lkml.kernel.org/r/20240219194431.159606-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon: move comments and fields for damos-quota-prioritization to the endSeongJae Park1-16/+14
The comments and definition of 'struct damos_quota' lists a few fields for effective quota generation first, fields for regions prioritization under the quota, and then remaining fields for effective quota generation. Readers' should unnecesssarily switch their context in the middle. List all the fields for the effective quota first, and then fields for the prioritization for making it easier to read. Link: https://lkml.kernel.org/r/20240219194431.159606-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24Docs/admin-guide/mm/damon/usage: document effective_bytes fileSeongJae Park1-3/+16
Update DAMON usage document for the effective quota file of the DAMON sysfs interface. Link: https://lkml.kernel.org/r/20240219194431.159606-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24Docs/ABI/damon: document effective_bytes sysfs fileSeongJae Park1-1/+9
Update the DAMON ABI doc for the effective_bytes sysfs file and the kdamond state file input command for updating the content of the file. Link: https://lkml.kernel.org/r/20240219194431.159606-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/sysfs: implement a kdamond command for updating schemes' effective ↵SeongJae Park3-0/+56
quotas Implement yet another kdamond 'state' file input command, namely 'update_schemes_effective_quotas'. If it is written, the 'effective_bytes' files of the kdamond will be updated to provide the current effective size quota of each scheme in bytes. Link: https://lkml.kernel.org/r/20240219194431.159606-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/sysfs-schemes: implement quota effective_bytes fileSeongJae Park1-0/+14
DAMON sysfs interface allows users to set two types of quotas, namely time quota and size quota. DAMOS converts time quota to a size quota and use smaller one among the resulting two size quotas. The resulting effective size quota can be helpful for debugging and analysis, but not exposed to the user. The recently added feedback-driven quota auto-tuning is making it even more mysterious. Implement a DAMON sysfs interface read-only empty file, namely 'effective_bytes', under the quota goal DAMON sysfs directory. It will be extended to expose the effective quota to the end user. Link: https://lkml.kernel.org/r/20240219194431.159606-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/damon/core: set damos_quota->esz as public field and documentSeongJae Park2-6/+8
Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself". The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning patchset[1] which has merged since commit 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") made the mechanism and the policy separated. That is, users can set a part of DAMOS control policies without a deep understanding of the mechanism but just their demands such as SLA. However, users are still required to do some additional work of manually collecting their target metric and feeding it to DAMOS. In the case of end-users who use DAMON sysfs interface, the context switches between user-space and kernel-space could also make it inefficient. The overhead is supposed to be only trivial in common cases, though. Meanwhile, in simple use cases, the target metric could be common system metrics that the kernel can efficiently self-retrieve, such as memory pressure stall time (PSI). Extend DAMOS quota auto-tuning to support multiple types of metrics including the DAMOS self-retrievable ones, and add support for memory pressure stall time metric. Different types of metrics can be supported in future. The auto-tuning capability is currently supported for only users of DAMOS kernel API and DAMON sysfs interface. Extend the support to DAMON_RECLAIM. Patches Sequence ================ First five patches are for helping debugging and fine-tuning existing quota control features. The first one (patch 1) exposes the effective quota that is made with given user inputs to DAMOS kernel API users and kernel-doc documents. Following four patches implement (patches 1, 2 and 3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the value. Following six patches cleanup and simplify the existing DAMOS quota auto-tuning code by improving layout of comments and data structures (patches 6 and 7), supporting common use cases, namely multiple goals (patches 8, 9 and 10), and simplifying the interface (patch 11). Then six patches for the main purpose of this patchset follow. The first three changes extend the core logic for various target metrics (patch 12), implement memory pressure stall time-based target metric support (patch 13), and update DAMON sysfs interface to support the new target metric (patch 14). Then, documentation updates for the features on design (patch 15), ABI (patch 16), and usage (patch 17) follow. Last three patches add auto-tuning support on DAMON_RECLAIM. The patches implement DAMON_RECLAIM parameters for user-feedback driven quota auto-tuning (patch 18), memory pressure stall time-driven quota self-tuning (patch 19), and finally update the DAMON_RECLAIM usage document for the new parameters (patch 20). [1] https://lore.kernel.org/all/20231130023652.50284-1-sj@kernel.org/ This patch (of 20): DAMOS allow users to specify the quota as they want in multiple ways including time quota, size quota, and feedback-based auto-tuning. DAMOS makes one effective quota out of the inputs and use it at the end. Knowing the current effective quota helps understanding DAMOS' internal mechanism and fine-tuning quotas. DAMON kernel API users can get the information from ->esz field of damos_quota struct, but the field is marked as private purpose, and not kernel-doc documented. Make it public and document. Link: https://lkml.kernel.org/r/20240219194431.159606-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240219194431.159606-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP checkLance Yang1-6/+12
khugepaged scans the entire address space in the background for each given mm, looking for opportunities to merge sequences of basic pages into huge pages. However, when an mm is inserted to the mm_slots list, and the MMF_DISABLE_THP flag is set later, this scanning process becomes unnecessary for that mm and can be skipped to avoid redundant operations, especially in scenarios with a large address space. On an Intel Core i5 CPU, the time taken by khugepaged to scan the address space of the process, which has been set with the MMF_DISABLE_THP flag after being added to the mm_slots list, is as follows (shorter is better): VMA Count | Old | New | Change --------------------------------------- 50 | 23us | 9us | -60.9% 100 | 32us | 9us | -71.9% 200 | 44us | 9us | -79.5% 400 | 75us | 9us | -88.0% 800 | 98us | 9us | -90.8% Once the count of VMAs for the process exceeds page_to_scan, khugepaged needs to wait for scan_sleep_millisecs ms before scanning the next process. IMO, unnecessary scans could actually be skipped with a very inexpensive mm->flags check in this case. This commit introduces a check before each scanning process to test the MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning process is bypassed, thereby improving the efficiency of khugepaged. This optimization is not a correctness issue but rather an enhancement to save expensive checks on each VMA when userspace cannot prctl itself before spawning into the new process. On some servers within our company, we deploy a daemon responsible for monitoring and updating local applications. Some applications prefer not to use THP, so the daemon calls prctl to disable THP before fork/exec. Conversely, for other applications, the daemon calls prctl to enable THP before fork/exec. Ideally, the daemon should invoke prctl after the fork, but its current implementation follows the described approach. In the Go standard library, there is no direct encapsulation of the fork system call; instead, fork and execve are combined into one through syscall.ForkExec. Link: https://lkml.kernel.org/r/20240129054551.57728-1-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24MAINTAINERS: update mm and memcg entriesMike Rapoport (IBM)1-0/+10
Add F: lines for memory management and memory cgroup include files. Link: https://lkml.kernel.org/r/20240208055727.142387-1-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.cBaoquan He12-61/+82
Nathan reported below building error: ===== $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7 $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all .. arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo': machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str' ==== On architecutres, like arm, s390, ppc, sh, function arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can only be compiled in when CONFIG_KEXEC_CORE=y. That's not right because arch_crash_save_vmcoreinfo() is used to export arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its compiling in. However, CONFIG_VMVCORE_INFO could be independent of CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO. Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is not set, it will report linking error. So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in arch_crash_save_vmcoreinfo(). [akpm@linux-foundation.org: remove stray newlines at eof] Link: https://lkml.kernel.org/r/20240129135033.157195-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24loongarch, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-1/+1
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on loongarch with some adjustments. Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the crashkernel reservation code. Link: https://lkml.kernel.org/r/20240124051254.67105-15-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24arm, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-2/+2
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on arm with some adjustments. Here use CONFIG_CRASH_RESERVE ifdef to replace CONFIG_KEXEC ifdef. Link: https://lkml.kernel.org/r/20240124051254.67105-14-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24riscv, crash: wrap crash dumping code into crash related ifdefsBaoquan He2-3/+8
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on risc-v with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the crashkernel reservation code. Link: https://lkml.kernel.org/r/20240124051254.67105-13-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mips, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-11/+6
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on mips with some adjustments. Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the crashkernel reservation code. Link: https://lkml.kernel.org/r/20240124051254.67105-12-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24sh, crash: wrap crash dumping code into crash related ifdefsBaoquan He2-1/+4
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on SuperH with some adjustments. Wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the crashkernel reservation code. Link: https://lkml.kernel.org/r/20240124051254.67105-11-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24s390, crash: wrap crash dumping code into crash related ifdefsBaoquan He3-0/+14
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on s390 with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery. Link: https://lkml.kernel.org/r/20240124051254.67105-10-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24ppc, crash: enforce KEXEC and KEXEC_FILE to select CRASH_DUMPBaoquan He1-0/+6
In PowerPC, the crash dumping and kexec reboot share code in arch_kexec_locate_mem_hole(), in which struct crash_mem is used. Here enfoce enforce KEXEC and KEXEC_FILE to select CRASH_DUMP for now. [bhe@redhat.com: fix allnoconfig on ppc] Link: https://lkml.kernel.org/r/ZbJwMyCpz4HDySoo@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240124051254.67105-9-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24arm64, crash: wrap crash dumping code into crash related ifdefsBaoquan He6-5/+15
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on arm64 with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery. [bhe@redhat.com: fix building error in generic codes] Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24x86, crash: wrap crash dumping code into crash related ifdefsBaoquan He10-11/+28
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on x86 with some adjustments. Here, also change some ifdefs or IS_ENABLED() check to more appropriate ones, e,g - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE)) [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope] Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24crash: clean up kdump related config itemsBaoquan He1-3/+4
By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and VMCORE_INFO. KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and crash dumping. Doing this makes codes and the corresponding config items more logical (the right item depends on or is selected by the left item). PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE--> kexec reboot KEXEC_FILE --| Link: https://lkml.kernel.org/r/20240124051254.67105-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24crash: split crash dumping code out from kexec_core.cBaoquan He10-292/+359
Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes need be built in to avoid compiling error when building kexec code even though the crash dumping functionality is not enabled. E.g -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- After splitting out crashkernel reservation code and vmcoreinfo exporting code, there's only crash related code left in kernel/crash_core.c. Now move crash related codes from kexec_core.c to crash_core.c and only build it in when CONFIG_CRASH_DUMP=y. And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope, or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP ifdef in generic kernel files. With these changes, crash_core codes are abstracted from kexec codes and can be disabled at all if only kexec reboot feature is wanted. Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24crash: remove dependency of FA_DUMP on CRASH_DUMPBaoquan He4-4/+3
In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's old memory content. Its elf header is constructed in 1st kernel and passed to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code of 1st kernel's old memory accessing in different architectures. Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump needs access global variable 'elfcorehdr_addr' to judge if it's in kdump kernel within function is_kdump_kernel(). In the current kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function setup_elfcorehdr() used to parse kernel parameter to fetch the passed value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP really doesn't have to depends on CRASH_DUMP. To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion, rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend on CRASH_DUMP. [bhe@redhat.com: power/fadump: make FA_DUMP select CRASH_DUMP] Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240124051254.67105-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24crash: split vmcoreinfo exporting code out from crash_core.cBaoquan He24-309/+342
Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24kexec: split crashkernel reservation code out from crash_core.cBaoquan He15-491/+531
Patch series "Split crash out from kexec and clean up related config items", v3. Motivation: ============= Previously, LKP reported a building error. When investigating, it can't be resolved reasonablly with the present messy kdump config items. https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/ The kdump (crash dumping) related config items could causes confusions: Firstly, CRASH_CORE enables codes including - crashkernel reservation; - elfcorehdr updating; - vmcoreinfo exporting; - crash hotplug handling; Now fadump of powerpc, kcore dynamic debugging and kdump all selects CRASH_CORE, while fadump - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing global variable 'elfcorehdr_addr'; - kcore only needs vmcoreinfo exporting; - kdump needs all of the current kernel/crash_core.c. So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this mislead people that we enable crash dumping, actual it's not. Secondly, It's not reasonable to allow KEXEC_CORE select CRASH_CORE. Because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and kdump. We could want kexec reboot, but disable kdump. In that case, CRASH_CORE should not be selected. -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- Thirdly, It's not reasonable to allow CRASH_DUMP select KEXEC_CORE. That could make KEXEC_CORE, CRASH_DUMP are enabled independently from KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE code built in doesn't make any sense because no kernel loading or switching will happen to utilize the KEXEC_CORE code. --------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y --------------------- In this case, what is worse, on arch sh and arm, KEXEC relies on MMU, while CRASH_DUMP can still be enabled when !MMU, then compiling error is seen as the lkp test robot reported in above link. ------arch/sh/Kconfig------ config ARCH_SUPPORTS_KEXEC def_bool MMU config ARCH_SUPPORTS_CRASH_DUMP def_bool BROKEN_ON_SMP --------------------------- Changes: =========== 1, split out crash_reserve.c from crash_core.c; 2, split out vmcore_infoc. from crash_core.c; 3, move crash related codes in kexec_core.c into crash_core.c; 4, remove dependency of FA_DUMP on CRASH_DUMP; 5, clean up kdump related config items; 6, wrap up crash codes in crash related ifdefs on all 8 arch-es which support crash dumping, except of ppc; Achievement: =========== With above changes, I can rearrange the config item logic as below (the right item depends on or is selected by the left item): PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE (for kexec reboot only) KEXEC_FILE --| Test ======== On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips, riscv, loongarch, I did below three cases of config item setting and building all passed. Take configs on x86_64 as exampmle here: (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump items are unset automatically: # Kexec and crash features # CONFIG_KEXEC is not set # CONFIG_KEXEC_FILE is not set # end of Kexec and crash features (2) set CONFIG_KEXEC_FILE and 'make olddefconfig': --------------- # Kexec and crash features CONFIG_CRASH_RESERVE=y CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y CONFIG_CRASH_MAX_MEMORY_RANGES=8192 # end of Kexec and crash features --------------- (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig': ------------------------ # Kexec and crash features CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y # end of Kexec and crash features ------------------------ Note: For ppc, it needs investigation to make clear how to split out crash code in arch folder. Hope Hari and Pingfan can help have a look, see if it's doable. Now, I make it either have both kexec and crash enabled, or disable both of them altogether. This patch (of 14): Both kdump and fa_dump of ppc rely on crashkernel reservation. Move the relevant codes into separate files: crash_reserve.c, include/linux/crash_reserve.h. And also add config item CRASH_RESERVE to control its enabling of the codes. And update config items which has relationship with crashkernel reservation. And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE when those scopes are only crashkernel reservation related. And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on arm64, x86 and risc-v because those architectures' crash_core.h is only related to crashkernel reservation. [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin] Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: refactor vmalloc_dump_obj() functionUladzislau Rezki (Sony)1-16/+17
This patch tends to simplify the function in question, by removing an extra stack "objp" variable, returning back to an early exit approach if spin_trylock() fails or VA was not found. Link: https://lkml.kernel.org/r/20240124180920.50725-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: improve description of vmap node layerUladzislau Rezki (Sony)1-14/+46
This patch adds extra explanation of recently added vmap node layer based on community feedback. No functional change. Link: https://lkml.kernel.org/r/20240124180920.50725-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: add a shrinker to drain vmap poolsUladzislau Rezki (Sony)1-0/+39
The added shrinker is used to return back current cached VAs into a global vmap space, when a system enters into a low memory mode. Link: https://lkml.kernel.org/r/20240102184633.748113-12-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: set nr_nodes based on CPUs in a systemUladzislau Rezki (Sony)1-6/+23
A number of nodes which are used in the alloc/free paths is set based on num_possible_cpus() in a system. Please note a high limit threshold though is fixed and corresponds to 128 nodes. For 32-bit or single core systems an access to a global vmap heap is not balanced. Such small systems do not suffer from lock contentions due to low number of CPUs. In such case the nr_nodes is equal to 1. Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ Link: https://lkml.kernel.org/r/20240102184633.748113-11-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: support multiple nodes in vmallocinfoUladzislau Rezki (Sony)1-73/+47
Allocated areas are spread among nodes, it implies that the scanning has to be performed individually of each node in order to dump all existing VAs. Link: https://lkml.kernel.org/r/20240102184633.748113-10-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: support multiple nodes in vread_iterUladzislau Rezki (Sony)1-14/+53
Extend the vread_iter() to be able to perform a sequential reading of VAs which are spread among multiple nodes. So a data read over the /dev/kmem correctly reflects a vmalloc memory layout. Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: add a scan area of VA only onceUladzislau Rezki (Sony)1-6/+6
Invoke a kmemleak_scan_area() function only for newly allocated objects to add a scan area within that object. There is no reason to add a same scan area(pointer to beginning or inside the object) several times. If a VA is obtained from the cache its scan area has already been associated. Link: https://lkml.kernel.org/r/20240202190628.47806-1-urezki@gmail.com Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: offload free_vmap_area_lock lockUladzislau Rezki (Sony)1-45/+342
Concurrent access to a global vmap space is a bottle-neck. We can simulate a high contention by running a vmalloc test suite. To address it, introduce an effective vmap node logic. Each node behaves as independent entity. When a node is accessed it serves a request directly(if possible) from its pool. This model has a size based pool for requests, i.e. pools are serialized and populated based on object size and real demand. A maximum object size that pool can handle is set to 256 pages. This technique reduces a pressure on the global vmap lock. Link: https://lkml.kernel.org/r/20240102184633.748113-8-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: remove global purge_vmap_area_root rb-treeUladzislau Rezki (Sony)1-53/+82
Similar to busy VA, lazily-freed area is stored to a node it belongs to. Such approach does not require any global locking primitive, instead an access becomes scalable what mitigates a contention. This patch removes a global purge-lock, global purge-tree and global purge list. Link: https://lkml.kernel.org/r/20240102184633.748113-7-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm/vmalloc: remove vmap_area_listBaoquan He8-15/+5
Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list. [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: remove global vmap_area_root rb-treeUladzislau Rezki (Sony)1-68/+174
Store allocated objects in a separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. An addr_to_node() function is used to do a proper address conversion to determine a node that contains a VA. Such approach balances VAs across nodes as a result an access becomes scalable. Number of nodes in a system depends on number of CPUs. Please note: 1. As of now allocated VAs are bound to a node-0. It means the patch does not give any difference comparing with a current behavior; 2. The global vmap_area_lock, vmap_area_root are removed as there is no need in it anymore. The vmap_area_list is still kept and is _empty_. It is exported for a kexec only; 3. The vmallocinfo and vread() have to be reworked to be able to handle multiple nodes. [urezki@gmail.com: mark vmap_init_free_space() with __init tag] Link: https://lkml.kernel.org/r/20240111132628.299644-1-urezki@gmail.com [urezki@gmail.com: fix a wrong value passed to __find_vmap_area()] Link: https://lkml.kernel.org/r/20240111121104.180993-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: move vmap_init_free_space() down in vmalloc.cUladzislau Rezki (Sony)1-41/+41
A vmap_init_free_space() is a function that setups a vmap space and is considered as part of initialization phase. Since a main entry which is vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow the pattern. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/20240102184633.748113-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: rename adjust_va_to_fit_type() functionUladzislau Rezki (Sony)1-7/+6
This patch renames the adjust_va_to_fit_type() function to va_clip() which is shorter and more expressive. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/20240102184633.748113-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm: vmalloc: add va_alloc() helperUladzislau Rezki (Sony)1-13/+28
Patch series "Mitigate a vmap lock contention", v3. 1. Motivation - Offload global vmap locks making it scaled to number of CPUS; - If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple; - There were complaints from XFS folk that a vmalloc might be contented on their workloads. 2. Design(high level overview) We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock. An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes. Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space. When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node. A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted: static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; } On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request. 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ This patch (of 11): Currently __alloc_vmap_area() function contains an open codded logic that finds and adjusts a VA based on allocation request. Introduce a va_alloc() helper that adjusts found VA only. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm,page_owner: update Documentation regarding page_owner_stacksOscar Salvador1-0/+45
Update page_owner documentation including the new page_owner_stacks feature to show how it can be used. Link: https://lkml.kernel.org/r/20240215215907.20121-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-24mm,page_owner: filter out stacks by a thresholdOscar Salvador1-1/+22
We want to be able to filter out the stacks based on a threshold we can can tune. By writing to 'count_threshold' file, we can adjust the threshold value. Link: https://lkml.kernel.org/r/20240215215907.20121-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>