From 2f440b72e852be428540579b5813ba2b8236578d Mon Sep 17 00:00:00 2001 From: Ricardo Koller Date: Wed, 26 Apr 2023 17:23:23 +0000 Subject: KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE Add a capability for userspace to specify the eager split chunk size. The chunk size specifies how many pages to break at a time, using a single allocation. Bigger the chunk size, more pages need to be allocated ahead of time. Suggested-by: Oliver Upton Signed-off-by: Ricardo Koller Reviewed-by: Gavin Shan Link: https://lore.kernel.org/r/20230426172330.1439644-6-ricarkol@google.com Signed-off-by: Oliver Upton --- Documentation/virt/kvm/api.rst | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'Documentation') diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index add067793b90..656bd293c8f4 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8445,6 +8445,33 @@ structure. When getting the Modified Change Topology Report value, the attr->addr must point to a byte where the value will be stored or retrieved from. +8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE +--------------------------------------- + +:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE +:Architectures: arm64 +:Type: vm +:Parameters: arg[0] is the new split chunk size. +:Returns: 0 on success, -EINVAL if any memslot was already created. + +This capability sets the chunk size used in Eager Page Splitting. + +Eager Page Splitting improves the performance of dirty-logging (used +in live migrations) when guest memory is backed by huge-pages. It +avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing +it eagerly when enabling dirty logging (with the +KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using +KVM_CLEAR_DIRTY_LOG. + +The chunk size specifies how many pages to break at a time, using a +single allocation for each chunk. Bigger the chunk size, more pages +need to be allocated ahead of time. + +The chunk size needs to be a valid block size. The list of acceptable +block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a +64-bit bitmap (each bit describing a block size). The default value is +0, to disable the eager page splitting. + 9. Known KVM API problems ========================= -- cgit v1.2.3 From 3e35d303ab7d22c4b6597e56ba46ee7cc61f3a5a Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 30 May 2023 12:03:28 +0100 Subject: arm64: module: rework module VA range selection Currently, the modules region is 128M in size, which is a problem for some large modules. Shanker reports [1] that the NVIDIA GPU driver alone can consume 110M of module space in some configurations. We'd like to make the modules region a full 2G such that we can always make use of a 2G range. It's possible to build kernel images which are larger than 128M in some configurations, such as when many debug options are selected and many drivers are built in. In these configurations, we can't legitimately select a base for a 128M module region, though we currently select a value for which allocation will fail. It would be nicer to have a diagnostic message in this case. Similarly, in theory it's possible to build a kernel image which is larger than 2G and which cannot support modules. While this isn't likely to be the case for any realistic kernel deplyed in the field, it would be nice if we could print a diagnostic in this case. This patch reworks the module VA range selection to use a 2G range, and improves handling of cases where we cannot select legitimate module regions. We now attempt to select a 128M region and a 2G region: * The 128M region is selected such that modules can use direct branches (with JUMP26/CALL26 relocations) to branch to kernel code and other modules, and so that modules can reference data and text (using PREL32 relocations) anywhere in the kernel image and other modules. This region covers the entire kernel image (rather than just the text) to ensure that all PREL32 relocations are in range even when the kernel data section is absurdly large. Where we cannot allocate from this region, we'll fall back to the full 2G region. * The 2G region is selected such that modules can use direct branches with PLTs to branch to kernel code and other modules, and so that modules can use reference data and text (with PREL32 relocations) in the kernel image and other modules. This region covers the entire kernel image, and the 128M region (if one is selected). The two module regions are randomized independently while ensuring the constraints described above. [1] https://lore.kernel.org/linux-arm-kernel/159ceeab-09af-3174-5058-445bc8dcf85b@nvidia.com/ Signed-off-by: Mark Rutland Reviewed-by: Ard Biesheuvel Cc: Shanker Donthineni Cc: Will Deacon Tested-by: Shanker Donthineni Link: https://lore.kernel.org/r/20230530110328.2213762-7-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- Documentation/arm64/memory.rst | 8 +-- arch/arm64/include/asm/memory.h | 2 +- arch/arm64/kernel/module.c | 139 +++++++++++++++++++++++++++------------- 3 files changed, 99 insertions(+), 50 deletions(-) (limited to 'Documentation') diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst index 2a641ba7be3b..55a55f30eed8 100644 --- a/Documentation/arm64/memory.rst +++ b/Documentation/arm64/memory.rst @@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit):: 0000000000000000 0000ffffffffffff 256TB user ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map [ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region] - ffff800000000000 ffff800007ffffff 128MB modules - ffff800008000000 fffffbffefffffff 124TB vmalloc + ffff800000000000 ffff80007fffffff 2GB modules + ffff800080000000 fffffbffefffffff 124TB vmalloc fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space @@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support): 0000000000000000 000fffffffffffff 4PB user fff0000000000000 ffff7fffffffffff ~4PB kernel logical memory map [fffd800000000000 ffff7fffffffffff] 512TB [kasan shadow region] - ffff800000000000 ffff800007ffffff 128MB modules - ffff800008000000 fffffbffefffffff 124TB vmalloc + ffff800000000000 ffff80007fffffff 2GB modules + ffff800080000000 fffffbffefffffff 124TB vmalloc fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index 215efc3bbbcf..6e0e5722f229 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -46,7 +46,7 @@ #define KIMAGE_VADDR (MODULES_END) #define MODULES_END (MODULES_VADDR + MODULES_VSIZE) #define MODULES_VADDR (_PAGE_END(VA_BITS_MIN)) -#define MODULES_VSIZE (SZ_128M) +#define MODULES_VSIZE (SZ_2G) #define VMEMMAP_START (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT))) #define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE) #define PCI_IO_END (VMEMMAP_START - SZ_8M) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index f64636c2fd05..dd851297596e 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -7,6 +7,8 @@ * Author: Will Deacon */ +#define pr_fmt(fmt) "Modules: " fmt + #include #include #include @@ -24,72 +26,119 @@ #include #include -static u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE; +static u64 module_direct_base __ro_after_init = 0; +static u64 module_plt_base __ro_after_init = 0; -#ifdef CONFIG_RANDOMIZE_BASE -static int __init kaslr_module_init(void) +/* + * Choose a random page-aligned base address for a window of 'size' bytes which + * entirely contains the interval [start, end - 1]. + */ +static u64 __init random_bounding_box(u64 size, u64 start, u64 end) { - u64 module_range; - u32 seed; + u64 max_pgoff, pgoff; - if (!kaslr_enabled()) + if ((end - start) >= size) return 0; - seed = get_random_u32(); + max_pgoff = (size - (end - start)) / PAGE_SIZE; + pgoff = get_random_u32_inclusive(0, max_pgoff); - if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) { - /* - * Randomize the module region over a 2 GB window covering the - * kernel. This reduces the risk of modules leaking information - * about the address of the kernel itself, but results in - * branches between modules and the core kernel that are - * resolved via PLTs. (Branches between modules will be - * resolved normally.) - */ - module_range = SZ_2G - (u64)(_end - _stext); - module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR); + return start - pgoff * PAGE_SIZE; +} + +/* + * Modules may directly reference data and text anywhere within the kernel + * image and other modules. References using PREL32 relocations have a +/-2G + * range, and so we need to ensure that the entire kernel image and all modules + * fall within a 2G window such that these are always within range. + * + * Modules may directly branch to functions and code within the kernel text, + * and to functions and code within other modules. These branches will use + * CALL26/JUMP26 relocations with a +/-128M range. Without PLTs, we must ensure + * that the entire kernel text and all module text falls within a 128M window + * such that these are always within range. With PLTs, we can expand this to a + * 2G window. + * + * We chose the 128M region to surround the entire kernel image (rather than + * just the text) as using the same bounds for the 128M and 2G regions ensures + * by construction that we never select a 128M region that is not a subset of + * the 2G region. For very large and unusual kernel configurations this means + * we may fall back to PLTs where they could have been avoided, but this keeps + * the logic significantly simpler. + */ +static int __init module_init_limits(void) +{ + u64 kernel_end = (u64)_end; + u64 kernel_start = (u64)_text; + u64 kernel_size = kernel_end - kernel_start; + + /* + * The default modules region is placed immediately below the kernel + * image, and is large enough to use the full 2G relocation range. + */ + BUILD_BUG_ON(KIMAGE_VADDR != MODULES_END); + BUILD_BUG_ON(MODULES_VSIZE < SZ_2G); + + if (!kaslr_enabled()) { + if (kernel_size < SZ_128M) + module_direct_base = kernel_end - SZ_128M; + if (kernel_size < SZ_2G) + module_plt_base = kernel_end - SZ_2G; } else { - /* - * Randomize the module region by setting module_alloc_base to - * a PAGE_SIZE multiple in the range [_etext - MODULES_VSIZE, - * _stext) . This guarantees that the resulting region still - * covers [_stext, _etext], and that all relative branches can - * be resolved without veneers unless this region is exhausted - * and we fall back to a larger 2GB window in module_alloc() - * when ARM64_MODULE_PLTS is enabled. - */ - module_range = MODULES_VSIZE - (u64)(_etext - _stext); + u64 min = kernel_start; + u64 max = kernel_end; + + if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) { + pr_info("2G module region forced by RANDOMIZE_MODULE_REGION_FULL\n"); + } else { + module_direct_base = random_bounding_box(SZ_128M, min, max); + if (module_direct_base) { + min = module_direct_base; + max = module_direct_base + SZ_128M; + } + } + + module_plt_base = random_bounding_box(SZ_2G, min, max); } - /* use the lower 21 bits to randomize the base of the module region */ - module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21; - module_alloc_base &= PAGE_MASK; + pr_info("%llu pages in range for non-PLT usage", + module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0); + pr_info("%llu pages in range for PLT usage", + module_plt_base ? (SZ_2G - kernel_size) / PAGE_SIZE : 0); return 0; } -subsys_initcall(kaslr_module_init) -#endif +subsys_initcall(module_init_limits); void *module_alloc(unsigned long size) { - u64 module_alloc_end = module_alloc_base + MODULES_VSIZE; - void *p; + void *p = NULL; /* * Where possible, prefer to allocate within direct branch range of the - * kernel such that no PLTs are necessary. This may fail, so we pass - * __GFP_NOWARN to silence the resulting warning. + * kernel such that no PLTs are necessary. */ - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base, - module_alloc_end, GFP_KERNEL | __GFP_NOWARN, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + if (module_direct_base) { + p = __vmalloc_node_range(size, MODULE_ALIGN, + module_direct_base, + module_direct_base + SZ_128M, + GFP_KERNEL | __GFP_NOWARN, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); + } + + if (!p && module_plt_base) { + p = __vmalloc_node_range(size, MODULE_ALIGN, + module_plt_base, + module_plt_base + SZ_2G, + GFP_KERNEL | __GFP_NOWARN, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); + } if (!p) { - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base, - module_alloc_base + SZ_2G, GFP_KERNEL, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + pr_warn_ratelimited("%s: unable to allocate memory\n", + __func__); } if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) { -- cgit v1.2.3 From 6df696cd9bc1ceed0e92e36908f88bbd16d18255 Mon Sep 17 00:00:00 2001 From: Oliver Upton Date: Fri, 9 Jun 2023 22:01:02 +0000 Subject: arm64: errata: Mitigate Ampere1 erratum AC03_CPU_38 at stage-2 AmpereOne has an erratum in its implementation of FEAT_HAFDBS that required disabling the feature on the design. This was done by reporting the feature as not implemented in the ID register, although the corresponding control bits were not actually RES0. This does not align well with the requirements of the architecture, which mandates these bits be RES0 if HAFDBS isn't implemented. The kernel's use of stage-1 is unaffected, as the HA and HD bits are only set if HAFDBS is detected in the ID register. KVM, on the other hand, relies on the RES0 behavior at stage-2 to use the same value for VTCR_EL2 on any cpu in the system. Mitigate the non-RES0 behavior by leaving VTCR_EL2.HA clear on affected systems. Cc: stable@vger.kernel.org Cc: D Scott Phillips Cc: Darren Hart Acked-by: D Scott Phillips Acked-by: Catalin Marinas Link: https://lore.kernel.org/r/20230609220104.1836988-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton --- Documentation/arm64/silicon-errata.rst | 3 +++ arch/arm64/Kconfig | 19 +++++++++++++++++++ arch/arm64/kernel/cpu_errata.c | 7 +++++++ arch/arm64/kvm/hyp/pgtable.c | 14 +++++++++++--- arch/arm64/tools/cpucaps | 1 + 5 files changed, 41 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst index 9e311bc43e05..cd46e2b20a81 100644 --- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -52,6 +52,9 @@ stable kernels. | Allwinner | A64/R18 | UNKNOWN1 | SUN50I_ERRATUM_UNKNOWN1 | +----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ +| Ampere | AmpereOne | AC03_CPU_38 | AMPERE_ERRATUM_AC03_CPU_38 | ++----------------+-----------------+-----------------+-----------------------------+ ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2064142 | ARM64_ERRATUM_2064142 | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index b1201d25a8a4..0987c637fbf2 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -406,6 +406,25 @@ menu "Kernel Features" menu "ARM errata workarounds via the alternatives framework" +config AMPERE_ERRATUM_AC03_CPU_38 + bool "AmpereOne: AC03_CPU_38: Certain bits in the Virtualization Translation Control Register and Translation Control Registers do not follow RES0 semantics" + default y + help + This option adds an alternative code sequence to work around Ampere + erratum AC03_CPU_38 on AmpereOne. + + The affected design reports FEAT_HAFDBS as not implemented in + ID_AA64MMFR1_EL1.HAFDBS, but (V)TCR_ELx.{HA,HD} are not RES0 + as required by the architecture. The unadvertised HAFDBS + implementation suffers from an additional erratum where hardware + A/D updates can occur after a PTE has been marked invalid. + + The workaround forces KVM to explicitly set VTCR_EL2.HA to 0, + which avoids enabling unadvertised hardware Access Flag management + at stage-2. + + If unsure, say Y. + config ARM64_WORKAROUND_CLEAN_CACHE bool diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 307faa2b4395..be66e94a21bd 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -729,6 +729,13 @@ const struct arm64_cpu_capabilities arm64_errata[] = { MIDR_FIXED(MIDR_CPU_VAR_REV(1,1), BIT(25)), .cpu_enable = cpu_clear_bf16_from_user_emulation, }, +#endif +#ifdef CONFIG_AMPERE_ERRATUM_AC03_CPU_38 + { + .desc = "AmpereOne erratum AC03_CPU_38", + .capability = ARM64_WORKAROUND_AMPERE_AC03_CPU_38, + ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1), + }, #endif { } diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 5282cb9ca4cf..32d92ce4bae5 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -611,10 +611,18 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) #ifdef CONFIG_ARM64_HW_AFDBM /* * Enable the Hardware Access Flag management, unconditionally - * on all CPUs. The features is RES0 on CPUs without the support - * and must be ignored by the CPUs. + * on all CPUs. In systems that have asymmetric support for the feature + * this allows KVM to leverage hardware support on the subset of cores + * that implement the feature. + * + * The architecture requires VTCR_EL2.HA to be RES0 (thus ignored by + * hardware) on implementations that do not advertise support for the + * feature. As such, setting HA unconditionally is safe, unless you + * happen to be running on a design that has unadvertised support for + * HAFDBS. Here be dragons. */ - vtcr |= VTCR_EL2_HA; + if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) + vtcr |= VTCR_EL2_HA; #endif /* CONFIG_ARM64_HW_AFDBM */ /* Set the vmid bits */ diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 40ba95472594..9f9a2d6652eb 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -77,6 +77,7 @@ WORKAROUND_2077057 WORKAROUND_2457168 WORKAROUND_2645198 WORKAROUND_2658417 +WORKAROUND_AMPERE_AC03_CPU_38 WORKAROUND_TRBE_OVERWRITE_FILL_MODE WORKAROUND_TSB_FLUSH_FAILURE WORKAROUND_TRBE_WRITE_OUT_OF_RANGE -- cgit v1.2.3