summaryrefslogtreecommitdiff
path: root/arch/arm64/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds4-51/+97
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-18Merge tag 'iommu-updates-v6.10' of ↵Linus Torvalds1-12/+1
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: "Core: - IOMMU memory usage observability - This will make the memory used for IO page tables explicitly visible. - Simplify arch_setup_dma_ops() Intel VT-d: - Consolidate domain cache invalidation - Remove private data from page fault message - Allocate DMAR fault interrupts locally - Cleanup and refactoring ARM-SMMUv2: - Support for fault debugging hardware on Qualcomm implementations - Re-land support for the ->domain_alloc_paging() callback ARM-SMMUv3: - Improve handling of MSI allocation failure - Drop support for the "disable_bypass" cmdline option - Major rework of the CD creation code, following on directly from the STE rework merged last time around. - Add unit tests for the new STE/CD manipulation logic AMD-Vi: - Final part of SVA changes with generic IO page fault handling Renesas IPMMU: - Add support for R8A779H0 hardware ... and a couple smaller fixes and updates across the sub-tree" * tag 'iommu-updates-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (80 commits) iommu/arm-smmu-v3: Make the kunit into a module arm64: Properly clean up iommu-dma remnants iommu/amd: Enable Guest Translation after reading IOMMU feature register iommu/vt-d: Decouple igfx_off from graphic identity mapping iommu/amd: Fix compilation error iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry iommu/arm-smmu-v3: Build the whole CD in arm_smmu_make_s1_cd() iommu/arm-smmu-v3: Move the CD generation for SVA into a function iommu/arm-smmu-v3: Allocate the CD table entry in advance iommu/arm-smmu-v3: Make arm_smmu_alloc_cd_ptr() iommu/arm-smmu-v3: Consolidate clearing a CD table entry iommu/arm-smmu-v3: Move the CD generation for S1 domains into a function iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry() iommu/arm-smmu-v3: Add an ops indirection to the STE code iommu/arm-smmu-qcom: Don't build debug features as a kernel module iommu/amd: Add SVA domain support iommu: Add ops->domain_alloc_sva() iommu/amd: Initial SVA support for AMD IOMMU iommu/amd: Add support for enable/disable IOPF iommu/amd: Add IO page fault notifier handler ...
2024-05-16Merge tag 'modules-6.10-rc1' of ↵Linus Torvalds1-0/+140
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a non-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone" * tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of kprobes: remove dependency on CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make execmem setup available regardless of CONFIG_MODULES powerpc: extend execmem_params for kprobes allocations arm64: extend execmem_info for generated code allocations riscv: extend execmem_params for generated code allocations mm/execmem, arch: convert remaining overrides of module_alloc to execmem mm/execmem, arch: convert simple overrides of module_alloc to execmem mm: introduce execmem_alloc() and execmem_free() module: make module_memory_{alloc,free} more self-contained sparc: simplify module_alloc() nios2: define virtual address space for modules mips: module: rename MODULE_START to MODULES_VADDR arm64: module: remove unneeded call to kasan_alloc_module_shadow() kallsyms: replace deprecated strncpy with strscpy module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-14Merge tag 'arm64-upstream' of ↵Linus Torvalds2-54/+57
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The most interesting parts are probably the mm changes from Ryan which optimise the creation of the linear mapping at boot and (separately) implement write-protect support for userfaultfd. Outside of our usual directories, the Kbuild-related changes under scripts/ have been acked by Masahiro whilst the drivers/acpi/ parts have been acked by Rafael and the addition of cpumask_any_and_but() has been acked by Yury. ACPI: - Support for the Firmware ACPI Control Structure (FACS) signature feature which is used to reboot out of hibernation on some systems Kbuild: - Support for building Flat Image Tree (FIT) images, where the kernel Image is compressed alongside a set of devicetree blobs Memory management: - Optimisation of our early page-table manipulation for creation of the linear mapping - Support for userfaultfd write protection, which brings along some nice cleanups to our handling of invalid but present ptes - Extend our use of range TLBI invalidation at EL1 Perf and PMUs: - Ensure that the 'pmu->parent' pointer is correctly initialised by PMU drivers - Avoid allocating 'cpumask_t' types on the stack in some PMU drivers - Fix parsing of the CPU PMU "version" field in assembly code, as it doesn't follow the usual architectural rules - Add best-effort unwinding support for USER_STACKTRACE - Minor driver fixes and cleanups Selftests: - Minor cleanups to the arm64 selftests (missing NULL check, unused variable) Miscellaneous: - Add a command-line alias for disabling 32-bit application support - Add part number for Neoverse-V2 CPUs - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits) arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: simplify arch_static_branch/_jump function arm64: Add USER_STACKTRACE support arm64: Add the arm64.no32bit_el0 command line option drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group kselftest: arm64: Add a null pointer check arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings kselftest/arm64: Remove unused parameters in abi test perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device ...
2024-05-14arch: make execmem setup available regardless of CONFIG_MODULESMike Rapoport (IBM)1-0/+140
execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-13Merge branches 'arm/renesas', 'arm/smmu', 'x86/amd', 'core' and 'x86/vt-d' ↵Joerg Roedel1-12/+1
into next
2024-05-10arm64: Properly clean up iommu-dma remnantsRobin Murphy1-8/+0
Thanks to the somewhat asymmetrical nature, while removing iommu_setup_dma_ops() from the arch_setup_dma_ops() flow, I managed to forget that arm64's teardown path was also specific to iommu-dma. Clean that up to match, otherwise probe deferral will lead to the arch code erroneously removing DMA ops set elsewhere. Reported-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/linux-iommu/Zi_LV28TR-P-PzXi@eriador.lumag.spb.ru/ Fixes: b67483b3c44e ("iommu/dma: Centralise iommu_setup_dma_ops()") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Acked-by: Will Deacon <will@kernel.org> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/d4cc20cbb0c45175e98dd76bf187e2ad6421296d.1714472573.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-05-09Merge branch 'for-next/mm' into for-next/coreWill Deacon1-44/+57
* for-next/mm: arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: mm: Don't remap pgtables for allocate vs populate arm64: mm: Batch dsb and isb when populating pgtables arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
2024-05-06mm/arm64: override clear_young_dirty_ptes() batch helperLance Yang1-0/+29
The per-pte get_and_clear/modify/set approach would result in unfolding/refolding for contpte mappings on arm64. So we need to override clear_young_dirty_ptes() for arm64 to avoid it. Link: https://lkml.kernel.org/r/20240418134435.6092-3-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Barry Song <21cnbao@gmail.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06arm64: mm: drop VM_FAULT_BADMAP/VM_FAULT_BADACCESSKefeng Wang1-23/+20
Patch series "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS", v2. Directly set SEGV_MAPRR or SEGV_ACCERR for arm/arm64 to remove the last two arch's private vm_fault reasons. This patch (of 2): If bad map or access, directly set si_code to SEGV_MAPRR or SEGV_ACCERR, also set fault to 0 and goto error handling, which make us to drop the arch's special vm fault reason. Link: https://lkml.kernel.org/r/20240411130925.73281-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240411130925.73281-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: Cristian Marussi <cristian.marussi@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-28arm64: defer clearing DAIF.DMark Rutland1-10/+0
For historical reasons we unmask debug exceptions in __cpu_setup(), but it's not necessary to unmask debug exceptions this early in the boot/idle entry paths. It would be better to unmask debug exceptions later in C code as this simplifies the current code and will make it easier to rework exception masking logic to handle non-DAIF bits in future (e.g. PSTATE.{ALLINT,PM}). We started clearing DAIF.D in __cpu_setup() in commit: 2ce39ad15182604b ("arm64: debug: unmask PSTATE.D earlier") At the time, we needed to ensure that DAIF.D was clear on the primary CPU before scheduling and preemption were possible, and chose to do this in __cpu_setup() so that this occurred in the same place for primary and secondary CPUs. As we cannot handle debug exceptions this early, we placed an ISB between initializing MDSCR_EL1 and clearing DAIF.D so that no exceptions should be triggered. Subsequently we rewrote the return-from-{idle,suspend} paths to use __cpu_setup() in commit: cabe1c81ea5be983 ("arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va") ... which allowed for earlier use of the MMU and had the desirable property of using the same code to reset the CPU in the cold and warm boot paths. This introduced a bug: DAIF.D was clear while cpu_do_resume() restored MDSCR_EL1 and other control registers (e.g. breakpoint/watchpoint control/value registers), and so we could unexpectedly take debug exceptions. We fixed that in commit: 744c6c37cc18705d ("arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1") ... by having cpu_do_resume() use the `disable_dbg` macro to set DAIF.D before restoring MDSCR_EL1 and other control registers. This relies on DAIF.D being subsequently cleared again in cpu_resume(). Subsequently we reworked DAIF masking in commit: 0fbeb318754860b3 ("arm64: explicitly mask all exceptions") ... where we began enforcing a policy that DAIF.D being set implies all other DAIF bits are set, and so e.g. we cannot take an IRQ while DAIF.D is set. As part of this the use of `disable_dbg` in cpu_resume() was replaced with `disable_daif` for consistency with the rest of the kernel. These days, there's no need to clear DAIF.D early within __cpu_setup(): * setup_arch() clears DAIF.DA before scheduling and preemption are possible on the primary CPU, avoiding the problem we we originally trying to work around. Note: DAIF.IF get cleared later when interrupts are enabled for the first time. * secondary_start_kernel() clears all DAIF bits before scheduling and preemption are possible on secondary CPUs. Note: with pseudo-NMI, the PMR is initialized here before any DAIF bits are cleared. Similar will be necessary for the architectural NMI. * cpu_suspend() restores all DAIF bits when returning from idle, ensuring that we don't unexpectedly leave DAIF.D clear or set. Note: with pseudo-NMI, the PMR is initialized here before DAIF is cleared. Similar will be necessary for the architectural NMI. This patch removes the unmasking of debug exceptions from __cpu_setup(), relying on the above locations to initialize DAIF. This allows some other cleanups: * It is no longer necessary for cpu_resume() to explicitly mask debug (or other) exceptions, as it is always called with all DAIF bits set. Thus we drop the use of `disable_daif`. * The `enable_dbg` macro is no longer used, and so is dropped. * It is no longer necessary to have an ISB immediately after initializing MDSCR_EL1 in __cpu_setup(), and we can revert to relying on the context synchronization that occurs when the MMU is enabled between __cpu_setup() and code which clears DAIF.D Comments are added to setup_arch() and secondary_start_kernel() to explain the initial unmasking of the DAIF bits. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240422113523.4070414-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-04-26dma-mapping: Simplify arch_setup_dma_ops()Robin Murphy1-2/+1
The dma_base, size and iommu arguments are only used by ARM, and can now easily be deduced from the device itself, so there's no need to pass them through the callchain as well. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Michael Kelley <mhklinux@outlook.com> # For Hyper-V Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/5291c2326eab405b1aa7693aa964e8d3cb7193de.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu/dma: Centralise iommu_setup_dma_ops()Robin Murphy1-2/+0
It's somewhat hard to see, but arm64's arch_setup_dma_ops() should only ever call iommu_setup_dma_ops() after a successful iommu_probe_device(), which means there should be no harm in achieving the same order of operations by running it off the back of iommu_probe_device() itself. This then puts it in line with the x86 and s390 .probe_finalize bodges, letting us pull it all into the main flow properly. As a bonus this lets us fold in and de-scope the PCI workaround setup as well. At this point we can also then pull the call up inside the group mutex, and avoid having to think about whether iommu_group_store_type() could theoretically race and free the domain if iommu_setup_dma_ops() ran just *before* iommu_device_use_default_domain() claims it... Furthermore we replace one .probe_finalize call completely, since the only remaining implementations are now one which only needs to run once for the initial boot-time probe, and two which themselves render that path unreachable. This leaves us a big step closer to realistically being able to unpick the variety of different things that iommu_setup_dma_ops() has been muddling together, and further streamline iommu-dma into core API flows in future. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> # For Intel IOMMU Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26arm64: mm: accelerate pagefault when VM_FAULT_BADACCESSKefeng Wang1-1/+3
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P 1 prot lat_sig' from lmbench testcase. Since the page fault is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26arm64: mm: cleanup __do_page_fault()Kefeng Wang1-20/+7
Patch series "arch/mm/fault: accelerate pagefault when badaccess", v2. After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 This patch (of 7): The __do_page_fault() only calls handle_mm_fault() after vm_flags checked, and it is only called by do_page_fault(), let's squash it into do_page_fault() to cleanup code. Link: https://lkml.kernel.org/r/20240403083805.1818160-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240403083805.1818160-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26arm64: mm: swap: support THP_SWAP on hardware with MTEBarry Song1-0/+45
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as the MTE code works with the assumption tags save/restore is always handling a folio with only one page. The limitation should be removed as more and more ARM64 SoCs have this feature. Co-existence of MTE and THP_SWAP becomes more and more important. This patch makes MTE tags saving support large folios, then we don't need to split large folios into base pages for swapping out on ARM64 SoCs with MTE any more. arch_prepare_to_swap() should take folio rather than page as parameter because we support THP swap-out as a whole. It saves tags for all pages in a large folio. As now we are restoring tags based-on folio, in arch_swap_restore(), we may increase some extra loops and early-exitings while refaulting a large folio which is still in swapcache in do_swap_page(). In case a large folio has nr pages, do_swap_page() will only set the PTE of the particular page which is causing the page fault. Thus do_swap_page() runs nr times, and each time, arch_swap_restore() will loop nr times for those subpages in the folio. So right now the algorithmic complexity becomes O(nr^2). Once we support mapping large folios in do_swap_page(), extra loops and early-exitings will decrease while not being completely removed as a large folio might get partially tagged in corner cases such as, 1. a large folio in swapcache can be partially unmapped, thus, MTE tags for the unmapped pages will be invalidated; 2. users might use mprotect() to set MTEs on a part of a large folio. arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who needed it. Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Steven Price <steven.price@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: remove pXd_huge()Peter Xu1-10/+0
This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: replace pXd_huge() with pXd_leaf()Peter Xu1-2/+2
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/arm64: merge pXd_huge() and pXd_leaf() definitionsPeter Xu1-6/+2
Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly differently. Redefine the pXd_huge() with pXd_leaf(). There used to be two traps for old aarch64 definitions over these APIs that I found when reading the code around, they're: (1) 4797ec2dc83a ("arm64: fix pud_huge() for 2-level pagetables") (2) 23bc8f69f0ec ("arm64: mm: fix p?d_leaf()") Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a problem (on PROT_NONE checks). To make sure it also works for (1), we move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to constantly returning "false" for 2-level pgtables, which looks even safer to cover both now. Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-19arm64: hibernate: Fix level3 translation fault in swsusp_save()Yaxiong Tian1-3/+0
On arm64 machines, swsusp_save() faults if it attempts to access MEMBLOCK_NOMAP memory ranges. This can be reproduced in QEMU using UEFI when booting with rodata=off debug_pagealloc=off and CONFIG_KFENCE=n: Unable to handle kernel paging request at virtual address ffffff8000000000 Mem abort info: ESR = 0x0000000096000007 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000eeb0b000 [ffffff8000000000] pgd=180000217fff9803, p4d=180000217fff9803, pud=180000217fff9803, pmd=180000217fff8803, pte=0000000000000000 Internal error: Oops: 0000000096000007 [#1] SMP Internal error: Oops: 0000000096000007 [#1] SMP Modules linked in: xt_multiport ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bpfilter rfkill at803x snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg dwmac_generic stmmac_platform snd_hda_codec stmmac joydev pcs_xpcs snd_hda_core phylink ppdev lp parport ramoops reed_solomon ip_tables x_tables nls_iso8859_1 vfat multipath linear amdgpu amdxcp drm_exec gpu_sched drm_buddy hid_generic usbhid hid radeon video drm_suballoc_helper drm_ttm_helper ttm i2c_algo_bit drm_display_helper cec drm_kms_helper drm CPU: 0 PID: 3663 Comm: systemd-sleep Not tainted 6.6.2+ #76 Source Version: 4e22ed63a0a48e7a7cff9b98b7806d8d4add7dc0 Hardware name: Greatwall GW-XXXXXX-XXX/GW-XXXXXX-XXX, BIOS KunLun BIOS V4.0 01/19/2021 pstate: 600003c5 (nZCv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : swsusp_save+0x280/0x538 lr : swsusp_save+0x280/0x538 sp : ffffffa034a3fa40 x29: ffffffa034a3fa40 x28: ffffff8000001000 x27: 0000000000000000 x26: ffffff8001400000 x25: ffffffc08113e248 x24: 0000000000000000 x23: 0000000000080000 x22: ffffffc08113e280 x21: 00000000000c69f2 x20: ffffff8000000000 x19: ffffffc081ae2500 x18: 0000000000000000 x17: 6666662074736420 x16: 3030303030303030 x15: 3038666666666666 x14: 0000000000000b69 x13: ffffff9f89088530 x12: 00000000ffffffea x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffffc08193f0d0 x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 0000000000000001 x5 : ffffffa0fff09dc8 x4 : 0000000000000000 x3 : 0000000000000027 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000004e Call trace: swsusp_save+0x280/0x538 swsusp_arch_suspend+0x148/0x190 hibernation_snapshot+0x240/0x39c hibernate+0xc4/0x378 state_store+0xf0/0x10c kobj_attr_store+0x14/0x24 The reason is swsusp_save() -> copy_data_pages() -> page_is_saveable() -> kernel_page_present() assuming that a page is always present when can_set_direct_map() is false (all of rodata_full, debug_pagealloc_enabled() and arm64_kfence_can_set_direct_map() false), irrespective of the MEMBLOCK_NOMAP ranges. Such MEMBLOCK_NOMAP regions should not be saved during hibernation. This problem was introduced by changes to the pfn_valid() logic in commit a7d9f306ba70 ("arm64: drop pfn_valid_within() and simplify pfn_valid()"). Similar to other architectures, drop the !can_set_direct_map() check in kernel_page_present() so that page_is_savable() skips such pages. Fixes: a7d9f306ba70 ("arm64: drop pfn_valid_within() and simplify pfn_valid()") Cc: <stable@vger.kernel.org> # 5.14.x Suggested-by: Mike Rapoport <rppt@kernel.org> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Co-developed-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Link: https://lore.kernel.org/r/20240417025248.386622-1-tianyaxiong@kylinos.cn [catalin.marinas@arm.com: rework commit message] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-04-15arm64/hugetlb: Fix page table walk in huge_pte_alloc()Anshuman Khandual1-1/+4
Currently normal HugeTLB fault ends up crashing the kernel, as p4dp derived from p4d_offset() is an invalid address when PGTABLE_LEVEL = 5. A p4d level entry needs to be allocated when not available while walking the page table during HugeTLB faults. Let's call p4d_alloc() to allocate such entries when required instead of current p4d_offset(). Unable to handle kernel paging request at virtual address ffffffff80000000 Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 52-bit VAs, pgdp=0000000081da9000 [ffffffff80000000] pgd=1000000082cec003, p4d=0000000082c32003, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 108 Comm: high_addr_hugep Not tainted 6.9.0-rc4 #48 Hardware name: Foundation-v8A (DT) pstate: 01402005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : huge_pte_alloc+0xd4/0x334 lr : hugetlb_fault+0x1b8/0xc68 sp : ffff8000833bbc20 x29: ffff8000833bbc20 x28: fff000080080cb58 x27: ffff800082a7cc58 x26: 0000000000000000 x25: fff0000800378e40 x24: fff00008008d6c60 x23: 00000000de9dbf07 x22: fff0000800378e40 x21: 0004000000000000 x20: 0004000000000000 x19: ffffffff80000000 x18: 1ffe00010011d7a1 x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000001 x14: 0000000000000000 x13: ffff8000816120d0 x12: ffffffffffffffff x11: 0000000000000000 x10: fff00008008ebd0c x9 : 0004000000000000 x8 : 0000000000001255 x7 : fff00008003e2000 x6 : 00000000061d54b0 x5 : 0000000000001000 x4 : ffffffff80000000 x3 : 0000000000200000 x2 : 0000000000000004 x1 : 0000000080000000 x0 : 0000000000000000 Call trace: huge_pte_alloc+0xd4/0x334 hugetlb_fault+0x1b8/0xc68 handle_mm_fault+0x260/0x29c do_page_fault+0xfc/0x47c do_translation_fault+0x68/0x74 do_mem_abort+0x44/0x94 el0_da+0x2c/0x9c el0t_64_sync_handler+0x70/0xc4 el0t_64_sync+0x190/0x194 Code: aa000084 cb010084 b24c2c84 8b130c93 (f9400260) ---[ end trace 0000000000000000 ]--- Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Fixes: a6bbf5d4d9d1 ("arm64: mm: Add definitions to support 5 levels of paging") Reported-by: Dev Jain <dev.jain@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240415094003.1812018-1-anshuman.khandual@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-04-12arm64: mm: Don't remap pgtables for allocate vs populateRyan Roberts1-32/+35
During linear map pgtable creation, each pgtable is fixmapped / fixunmapped twice; once during allocation to zero the memory, and a again during population to write the entries. This means each table has 2 TLB invalidations issued against it. Let's fix this so that each table is only fixmapped/fixunmapped once, halving the number of TLBIs, and improving performance. Achieve this by separating allocation and initialization (zeroing) of the page. The allocated page is now fixmapped directly by the walker and initialized, before being populated and finally fixunmapped. This approach keeps the change small, but has the side effect that late allocations (using __get_free_page()) must also go through the generic memory clearing routine. So let's tell __get_free_page() not to zero the memory to avoid duplication. Additionally this approach means that fixmap/fixunmap is still used for late pgtable modifications. That's not technically needed since the memory is all mapped in the linear map by that point. That's left as a possible future optimization if found to be needed. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 11 (0%) | 161 (0%) | 656 (0%) | 1654 (0%) after | 10 (-11%) | 104 (-35%) | 438 (-33%) | 1223 (-26%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-4-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-04-12arm64: mm: Batch dsb and isb when populating pgtablesRyan Roberts1-1/+10
After removing uneccessary TLBIs, the next bottleneck when creating the page tables for the linear map is DSB and ISB, which were previously issued per-pte in __set_pte(). Since we are writing multiple ptes in a given pte table, we can elide these barriers and insert them once we have finished writing to the table. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 78 (0%) | 435 (0%) | 1723 (0%) | 3779 (0%) after | 11 (-86%) | 161 (-63%) | 656 (-62%) | 1654 (-56%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-3-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-04-12arm64: mm: Don't remap pgtables per-cont(pte|pmd) blockRyan Roberts1-13/+14
A large part of the kernel boot time is creating the kernel linear map page tables. When rodata=full, all memory is mapped by pte. And when there is lots of physical ram, there are lots of pte tables to populate. The primary cost associated with this is mapping and unmapping the pte table memory in the fixmap; at unmap time, the TLB entry must be invalidated and this is expensive. Previously, each pmd and pte table was fixmapped/fixunmapped for each cont(pte|pmd) block of mappings (16 entries with 4K granule). This means we ended up issuing 32 TLBIs per (pmd|pte) table during the population phase. Let's fix that, and fixmap/fixunmap each page once per population, for a saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot speedup. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%) after | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-03-15Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds11-58/+463
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-13Revert "arm64: mm: add support for WXN memory translation attribute"Catalin Marinas1-6/+0
This reverts commit 50e3ed0f93f4f62ed2aa83de5db6cb84ecdd5707. The SCTLR_EL1.WXN control forces execute-never when a page has write permissions. While the idea of hardening such write/exec combinations is good, with permissions indirection enabled (FEAT_PIE) this control becomes RES0. FEAT_PIE introduces a slightly different form of WXN which only has an effect when the base permission is RWX and the write is toggled by the permission overlay (FEAT_POE, not yet supported by the arm64 kernel). Revert the patch for now. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/ZfGESD3a91lxH367@arm.com
2024-03-05arm64/mm: improve comment in contpte_ptep_get_lockless()Ryan Roberts1-10/+14
Make clear the atmicity/consistency requirements of the API and how we achieve them. Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/ Link: https://lkml.kernel.org/r/20240226120321.1055731-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05arm64/mm: export contpte symbols only to GPL usersRyan Roberts1-11/+11
Patch series "Address some contpte nits". These 2 patches address some nits raised by Catalin late in the review cycle for my contpte series [1]. [1] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.com/ This patch (of 2): The contpte symbols must be exported since some of the public inline ptep_* APIs are called from modules and these inlines now call the contpte functions. Originally they were exported as EXPORT_SYMBOL() for fear of breaking out-of-tree modules. But we subsequently concluded that EXPORT_SYMBOL_GPL() should be safe since these functions are deeply core mm routines, and any module operating at this level is not going to be able to survive on EXPORT_SYMBOL alone. Link: https://lkml.kernel.org/r/20240226120321.1055731-1-ryan.roberts@arm.com Link: https://lore.kernel.org/linux-mm/f9fc2b31-11cb-4969-8961-9c89fea41b74@nvidia.com/ Link: https://lkml.kernel.org/r/20240226120321.1055731-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-01arm64/mm: Avoid ID mapping of kpti flag if it is no longer neededArd Biesheuvel1-1/+1
arm64_use_ng_mappings will be set to 'true' by the early boot code if it decides to use non-global (nG) attributes for all kernel mappings, typically when enabling KASLR on a system that does not implement E0PD. In this case, the G-to-nG update routines are never called, and so there is no reason to create the writable mapping of the associated status flag in the ID map. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240301104046.1234309-6-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-24arm64, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-1/+1
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on arm64 with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery. [bhe@redhat.com: fix building error in generic codes] Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: automatically fold contpte mappingsRyan Roberts1-0/+64
There are situations where a change to a single PTE could cause the contpte block in which it resides to become foldable (i.e. could be repainted with the contiguous bit). Such situations arise, for example, when user space temporarily changes protections, via mprotect, for individual pages, such can be the case for certain garbage collectors. We would like to detect when such a PTE change occurs. However this can be expensive due to the amount of checking required. Therefore only perform the checks when an indiviual PTE is modified via mprotect (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only when we are setting the final PTE in a contpte-aligned block. Link: https://lkml.kernel.org/r/20240215103205.2607016-19-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: implement new [get_and_]clear_full_ptes() batch APIsRyan Roberts1-0/+17
Optimize the contpte implementation to fix some of the exit/munmap/dontneed performance regression introduced by the initial contpte commit. Subsequent patches will solve it entirely. During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be cleared. Previously this was done 1 PTE at a time. But the core-mm supports batched clear via the new [get_and_]clear_full_ptes() APIs. So let's implement those APIs and for fully covered contpte mappings, we no longer need to unfold the contpte. This significantly reduces unfolding operations, reducing the number of tlbis that must be issued. Link: https://lkml.kernel.org/r/20240215103205.2607016-15-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: implement new wrprotect_ptes() batch APIRyan Roberts1-0/+38
Optimize the contpte implementation to fix some of the fork performance regression introduced by the initial contpte commit. Subsequent patches will solve it entirely. During fork(), any private memory in the parent must be write-protected. Previously this was done 1 PTE at a time. But the core-mm supports batched wrprotect via the new wrprotect_ptes() API. So let's implement that API and for fully covered contpte mappings, we no longer need to unfold the contpte. This has 2 benefits: - reduced unfolding, reduces the number of tlbis that must be issued. - The memory remains contpte-mapped ("folded") in the parent, so it continues to benefit from the more efficient use of the TLB after the fork. The optimization to wrprotect a whole contpte block without unfolding is possible thanks to the tightening of the Arm ARM in respect to the definition and behaviour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/ja-07/ Link: https://lkml.kernel.org/r/20240215103205.2607016-14-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: wire up PTE_CONT for user mappingsRyan Roberts2-0/+286
With the ptep API sufficiently refactored, we can now introduce a new "contpte" API layer, which transparently manages the PTE_CONT bit for user mappings. In this initial implementation, only suitable batches of PTEs, set via set_ptes(), are mapped with the PTE_CONT bit. Any subsequent modification of individual PTEs will cause an "unfold" operation to repaint the contpte block as individual PTEs before performing the requested operation. While, a modification of a single PTE could cause the block of PTEs to which it belongs to become eligible for "folding" into a contpte entry, "folding" is not performed in this initial implementation due to the costs of checking the requirements are met. Due to this, contpte mappings will degrade back to normal pte mappings over time if/when protections are changed. This will be solved in a future patch. Since a contpte block only has a single access and dirty bit, the semantic here changes slightly; when getting a pte (e.g. ptep_get()) that is part of a contpte mapping, the access and dirty information are pulled from the block (so all ptes in the block return the same access/dirty info). When changing the access/dirty info on a pte (e.g. ptep_set_access_flags()) that is part of a contpte mapping, this change will affect the whole contpte block. This is works fine in practice since we guarantee that only a single folio is mapped by a contpte block, and the core-mm tracks access/dirty information per folio. In order for the public functions, which used to be pure inline, to continue to be callable by modules, export all the contpte_* symbols that are now called by those public inline functions. The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter at build time. It defaults to enabled as long as its dependency, TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not enabled, then there is no chance of meeting the physical contiguity requirement for contpte mappings. Link: https://lkml.kernel.org/r/20240215103205.2607016-13-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: new ptep layer to manage contig bitRyan Roberts7-44/+44
Create a new layer for the in-table PTE manipulation APIs. For now, The existing API is prefixed with double underscore to become the arch-private API and the public API is just a simple wrapper that calls the private API. The public API implementation will subsequently be used to transparently manipulate the contiguous bit where appropriate. But since there are already some contig-aware users (e.g. hugetlb, kernel mapper), we must first ensure those users use the private API directly so that the future contig-bit manipulations in the public API do not interfere with those existing uses. The following APIs are treated this way: - ptep_get - set_pte - set_ptes - pte_clear - ptep_get_and_clear - ptep_test_and_clear_young - ptep_clear_flush_young - ptep_set_wrprotect - ptep_set_access_flags Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: convert ptep_clear() to ptep_get_and_clear()Ryan Roberts1-1/+1
ptep_clear() is a generic wrapper around the arch-implemented ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into a public version and private version (__ptep_get_and_clear()) to support the transparent contpte work. We won't have a private version of ptep_clear() so let's convert it to directly call ptep_get_and_clear(). Link: https://lkml.kernel.org/r/20240215103205.2607016-10-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: convert set_pte_at() to set_ptes(..., 1)Ryan Roberts2-6/+6
Since set_ptes() was introduced, set_pte_at() has been implemented as a generic macro around set_ptes(..., 1). So this change should continue to generate the same code. However, making this change prepares us for the transparent contpte support. It means we can reroute set_ptes() to __set_ptes(). Since set_pte_at() is a generic macro, there will be no equivalent __set_pte_at() to reroute to. Note that a couple of calls to set_pte_at() remain in the arch code. This is intentional, since those call sites are acting on behalf of core-mm and should continue to call into the public set_ptes() rather than the arch-private __set_ptes(). Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep)Ryan Roberts6-15/+15
There are a number of places in the arch code that read a pte by using the READ_ONCE() macro. Refactor these call sites to instead use the ptep_get() helper, which itself is a READ_ONCE(). Generated code should be the same. This will benefit us when we shortly introduce the transparent contpte support. In this case, ptep_get() will become more complex so we now have all the code abstracted through it. Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/hugetlb: move page order check inside hugetlb_cma_reserve()Anshuman Khandual1-7/+0
All platforms could benefit from page order check against MAX_PAGE_ORDER before allocating a CMA area for gigantic hugetlb pages. Let's move this check from individual platforms to generic hugetlb. Link: https://lkml.kernel.org/r/20240209054221.1403364-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: ptdump: have ptdump_check_wx() return boolChristophe Leroy1-3/+8
Have ptdump_check_wx() return true when the check is successful or false otherwise. [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)] Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WXChristophe Leroy1-2/+0
All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16arm64: mm: add support for WXN memory translation attributeArd Biesheuvel1-0/+6
The AArch64 virtual memory system supports a global WXN control, which can be enabled to make all writable mappings implicitly no-exec. This is a useful hardening feature, as it prevents mistakes in managing page table permissions from being exploited to attack the system. When enabled at EL1, the restrictions apply to both EL1 and EL0. EL1 is completely under our control, and has been cleaned up to allow WXN to be enabled from boot onwards. EL0 is not under our control, but given that widely deployed security features such as selinux or PaX already limit the ability of user space to create mappings that are writable and executable at the same time, the impact of enabling this for EL0 is expected to be limited. (For this reason, common user space libraries that have a legitimate need for manipulating executable code already carry fallbacks such as [0].) If enabled at compile time, the feature can still be disabled at boot if needed, by passing arm64.nowxn on the kernel command line. [0] https://github.com/libffi/libffi/blob/master/src/closures.c#L440 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240214122845.2033971-88-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: ptdump: Deal with translation levels folded at runtimeArd Biesheuvel1-5/+12
Currently, the ptdump code deals with folded PMD or PUD levels at build time, by omitting those levels when invoking note_page. IOW, note_page() is never invoked with level == 1 if P4Ds are folded in the build configuration. With the introduction of LPA2 support, we will defer some of these folding decisions to runtime, so let's take care of this by overriding the 'level' argument when this condition triggers. Substituting the PUD or PMD strings for "PGD" when the level in question is folded at build time is no longer necessary, and so the conditional expressions can be simplified. This also makes the indirection of the 'name' field unnecessary, so change that into a char[] array, and make the whole thing __ro_after_init. Note that the mm_p?d_folded() functions currently ignore their mm pointer arguments, but let's wire them up correctly anyway. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-83-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: ptdump: Disregard unaddressable VA spaceArd Biesheuvel1-2/+2
Configurations built with support for 52-bit virtual addressing can also run on CPUs that only support 48 bits of VA space, in which case only that part of swapper_pg_dir that represents the 48-bit addressable region is relevant, and everything else is ignored by the hardware. Our software pagetable walker has little in the way of input address validation, and so it will happily start a walk from an address that is not representable by the number of paging levels that are actually active, resulting in lots of bogus output from the page table dumper unless we take care to start at a valid address. So define the start address at runtime based on vabits_actual. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-82-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: mm: Add support for folding PUDs at runtimeArd Biesheuvel2-1/+3
In order to support LPA2 on 16k pages in a way that permits non-LPA2 systems to run the same kernel image, we have to be able to fall back to at most 48 bits of virtual addressing. Falling back to 48 bits would result in a level 0 with only 2 entries, which is suboptimal in terms of TLB utilization. So instead, let's fall back to 47 bits in that case. This means we need to be able to fold PUDs dynamically, similar to how we fold P4Ds for 48 bit virtual addressing on LPA2 with 4k pages. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-81-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: kasan: Reduce minimum shadow alignment and enable 5 level pagingArd Biesheuvel1-19/+129
Allow the KASAN init code to deal with 5 levels of paging, and relax the requirement that the shadow region is aligned to the top level pgd_t size. This is necessary for LPA2 based 52-bit virtual addressing, where the KASAN shadow will never be aligned to the pgd_t size. Allowing this also enables the 16k/48-bit case for KASAN, which is a nice bonus. This involves some hackery to manipulate the root and next level page tables without having to distinguish all the various configurations, including 16k/48-bits (which has a two entry pgd_t level), and LPA2 configurations running with one translation level less on non-LPA2 hardware. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-80-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: mm: Add 5 level paging support to fixmap and swapper handlingArd Biesheuvel2-5/+44
Add support for using 5 levels of paging in the fixmap, as well as in the kernel page table handling code which uses fixmaps internally. This also handles the case where a 5 level build runs on hardware that only supports 4 levels of paging. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-79-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: Enable LPA2 at boot if supported by the systemArd Biesheuvel3-2/+9
Update the early kernel mapping code to take 52-bit virtual addressing into account based on the LPA2 feature. This is a bit more involved than LVA (which is supported with 64k pages only), given that some page table descriptor bits change meaning in this case. To keep the handling in asm to a minimum, the initial ID map is still created with 48-bit virtual addressing, which implies that the kernel image must be loaded into 48-bit addressable physical memory. This is currently required by the boot protocol, even though we happen to support placement outside of that for LVA/64k based configurations. Enabling LPA2 involves more than setting TCR.T1SZ to a lower value, there is also a DS bit in TCR that needs to be set, and which changes the meaning of bits [9:8] in all page table descriptors. Since we cannot enable DS and every live page table descriptor at the same time, let's pivot through another temporary mapping. This avoids the need to reintroduce manipulations of the page tables with the MMU and caches disabled. To permit the LPA2 feature to be overridden on the kernel command line, which may be necessary to work around silicon errata, or to deal with mismatched features on heterogeneous SoC designs, test for CPU feature overrides first, and only then enable LPA2. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversionArd Biesheuvel1-10/+60
Add support for 5 level paging in the G-to-nG routine that creates its own temporary page tables to traverse the swapper page tables. Also add support for running the 5 level configuration with the top level folded at runtime, to support CPUs that do not implement the LPA2 extension. While at it, wire up the level skipping logic so it will also trigger on 4 level configurations with LPA2 enabled at build time but not active at runtime, as we'll fall back to 3 level paging in that case. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-77-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16arm64: mm: Add definitions to support 5 levels of pagingArd Biesheuvel2-5/+41
Add the required types and descriptor accessors to support 5 levels of paging in the common code. This is one of the prerequisites for supporting 52-bit virtual addressing with 4k pages. Note that this does not cover the code that handles kernel mappings or the fixmap. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240214122845.2033971-76-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>