summaryrefslogtreecommitdiff
path: root/arch/s390/boot/vmem.c
AgeCommit message (Collapse)AuthorFilesLines
2024-06-11s390/mm: Restore mapping of kernel image using large pagesAlexander Gordeev1-1/+1
Since physical and virtual kernel address spaces are uncoupled the kernel image is not mapped using large segment pages anymore, which is a regression. Put the kernel image at the same large segment page offset in physical memory as in virtual memory. Such approach preserves the existing number of bits of entropy used for randomization of the kernel location in virtual memory when KASLR is on. As result, the kernel is mapped using large segment pages. Fixes: c98d2ecae08f ("s390/mm: Uncouple physical vs virtual address spaces") Reported-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-11s390/mm: Allow large pages only for aligned physical addressesAlexander Gordeev1-2/+8
Do not allow creation of large pages against physical addresses, which itself are not aligned on the correct boundary. Failure to do so might lead to referencing wrong memory as result of the way DAT works. Fixes: c98d2ecae08f ("s390/mm: Uncouple physical vs virtual address spaces") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-17s390/boot: Rework deployment of the kernel imageAlexander Gordeev1-5/+2
Rework deployment of kernel image for both compressed and uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration variable. In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing the kernel to a temporary buffer and copying it to the target address. Instead, uncompress it directly to the target destination. In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the kernel to default 0x100000 location when KASLR is disabled or failed. Instead, use the uncompressed kernel image directly. In case KASLR is disabled or failed .amode31 section location in memory is not randomized and precedes the kernel image. In case CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the area used by the decompression algorithm. That is fine, since that area is not used after the decompression finished and the size of .amode31 section is not expected to exceed BOOT_HEAP_SIZE ever. There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is enabled. Therefore, rename decompress_kernel() to deploy_kernel(), which better describes both uncompressed and compressed cases. Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000 (the size of .amode31 section) in the decompressor linker script. Modify the vmlinux linker script to force the size of .amode31 section to AMODE31_SIZE (the value of (_eamode31 - _samode31) could otherwise differ as result of compiler options used). Introduce __START_KERNEL macro that defines the kernel ELF image entry point and set it to the currrent value of 0x100000. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17s390/mm: Uncouple physical vs virtual address spacesAlexander Gordeev1-50/+61
The uncoupling physical vs virtual address spaces brings the following benefits to s390: - virtual memory layout flexibility; - closes the address gap between kernel and modules, it caused s390-only problems in the past (e.g. 'perf' bugs); - allows getting rid of trampolines used for module calls into kernel; - allows simplifying BPF trampoline; - minor performance improvement in branch prediction; - kernel randomization entropy is magnitude bigger, as it is derived from the amount of available virtual, not physical memory; The whole change could be described in two pictures below: before and after the change. Some aspects of the virtual memory layout setup are not clarified (number of page levels, alignment, DMA memory), since these are not a part of this change or secondary with regard to how the uncoupling itself is implemented. The focus of the pictures is to explain why __va() and __pa() macros are implemented the way they are. Memory layout in V==R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ identity mapping start | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | identity | phys == virt | | mapping | virt == phys | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start | | | | | | +- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start | | | | kernel text/data | kernel text/data | phys == kvirt | | | +------------------+------------------+ kernel phys/virt end | | | | | | | | | | | | +- ident_map_size -+- ident_map_size -+ identity mapping end | | | ... unused gap | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Memory layout in V!=R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | | | | ... unused gap | | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB) | | | | | | +- __kaslr_offset_phys | kernel rand. phys start | | | | kernel text/data | | | | | +------------------+ | kernel phys end | | | | | | | | | | | | +- ident_map_size -+ | | | | ... unused gap | | | +- __identity_base + identity mapping start (>= 2GB) | | | identity | phys == virt - __identity_base | mapping | virt == phys + __identity_base | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +- __kaslr_offset -+ kernel rand. virt start | | | kernel text/data | phys == (kvirt - __kaslr_offset) + | | __kaslr_offset_phys +- kernel .bss end + kernel rand. virt end | | | ... unused gap | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Unused gaps in the virtual memory layout could be present or not - depending on how partucular system is configured. No page tables are created for the unused gaps. The relative order of vmalloc, modules and kernel image in virtual memory is defined by following considerations: - start of the modules area and end of the kernel should reside within 4GB to accommodate relative 32-bit jumps. The best way to achieve that is to place kernel next to modules; - vmalloc and module areas should locate next to each other to prevent failures and extra reworks in user level tools (makedumpfile, crash, etc.) which treat vmalloc and module addresses similarily; - kernel needs to be the last area in the virtual memory layout to easily distinguish between kernel and non-kernel virtual addresses. That is needed to (again) simplify handling of addresses in user level tools and make __pa() macro faster (see below); Concluding the above, the relative order of the considered virtual areas in memory is: vmalloc - modules - kernel. Therefore, the only change to the current memory layout is moving kernel to the end of virtual address space. With that approach the implementation of __pa() macro is straightforward - all linear virtual addresses less than kernel base are considered identity mapping: phys == virt - __identity_base All addresses greater than kernel base are kernel ones: phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys By contrast, __va() macro deals only with identity mapping addresses: virt == phys + __identity_base .amode31 section is mapped separately and is not covered by __pa() macro. In fact, it could have been handled easily by checking whether a virtual address is within the section or not, but there is no need for that. Thus, let __pa() code do as little machine cycles as possible. The KASAN shadow memory is located at the very end of the virtual memory layout, at addresses higher than the kernel. However, that is not a linear mapping and no code other than KASAN instrumentation or API is expected to access it. When KASLR mode is enabled the kernel base address randomized within a memory window that spans whole unused virtual address space. The size of that window depends from the amount of physical memory available to the system, the limit imposed by UltraVisor (if present) and the vmalloc area size as provided by vmalloc= kernel command line parameter. In case the virtual memory is exhausted the minimum size of the randomization window is forcefully set to 2GB, which amounts to in 15 bits of entropy if KASAN is enabled or 17 bits of entropy in default configuration. The default kernel offset 0x100000 is used as a magic value both in the decompressor code and vmlinux linker script, but it will be removed with a follow-up change. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-03-07mm/treewide: replace pud_large() with pud_leaf()Peter Xu1-1/+1
pud_large() is always defined as pud_leaf(). Merge their usages. Chose pud_leaf() because pud_leaf() is a global API, while pud_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-07mm/treewide: replace pmd_large() with pmd_leaf()Peter Xu1-1/+1
pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-06s390/cmma: rework no-dat handlingHeiko Carstens1-0/+17
Rework the way physical pages are set no-dat / dat: The old way is: - Rely on that all pages are initially marked "dat" - Allocate page tables for the kernel mapping - Enable dat - Walk the whole kernel mapping and set PG_arch_1 bit in all struct pages that belong to pages of kernel page tables - Walk all struct pages and test and clear the PG_arch_1 bit. If the bit is not set, set the page state to no-dat - For all subsequent page table allocations, set the page state to dat (remove the no-dat state) on allocation time Change this rather complex logic to a simpler approach: - Set the whole physical memory (all pages) to "no-dat" - Explicitly set those page table pages to "dat" which are part of the kernel image (e.g. swapper_pg_dir) - For all subsequent page table allocations, set the page state to dat (remove the no-dat state) on allocation time In result the code is simpler, and this also allows to get rid of one odd usage of the PG_arch_1 bit. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-11-03Merge tag 's390-6.7-1' of ↵Linus Torvalds1-9/+8
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Get rid of private VM_FAULT flags - Add word-at-a-time implementation - Add DCACHE_WORD_ACCESS support - Cleanup control register handling - Disallow CPU hotplug of CPU 0 to simplify its handling complexity, following a similar restriction in x86 - Optimize pai crypto map allocation - Update the list of crypto express EP11 coprocessor operation modes - Fixes and improvements for secure guests AP pass-through - Several fixes to address incorrect page marking for address translation with the "cmma no-dat" feature, preventing potential incorrect guest TLB flushes - Fix early IPI handling - Several virtual vs physical address confusion fixes - Various small fixes and improvements all over the code * tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (74 commits) s390/cio: replace deprecated strncpy with strscpy s390/sclp: replace deprecated strncpy with strtomem s390/cio: fix virtual vs physical address confusion s390/cio: export CMG value as decimal s390: delete the unused store_prefix() function s390/cmma: fix handling of swapper_pg_dir and invalid_pg_dir s390/cmma: fix detection of DAT pages s390/sclp: handle default case in sclp memory notifier s390/pai_crypto: remove per-cpu variable assignement in event initialization s390/pai: initialize event count once at initialization s390/pai_crypto: use PERF_ATTACH_TASK define for per task detection s390/mm: add missing arch_set_page_dat() call to gmap allocations s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc() s390/cmma: fix initial kernel address space page table walk s390/diag: add missing virt_to_phys() translation to diag224() s390/mm,fault: move VM_FAULT_ERROR handling to do_exception() s390/mm,fault: remove VM_FAULT_BADMAP and VM_FAULT_BADACCESS s390/mm,fault: remove VM_FAULT_SIGNAL s390/mm,fault: remove VM_FAULT_BADCONTEXT s390/mm,fault: simplify kfence fault handling ...
2023-10-16s390/vmem: remove unused variableVasily Gorbik1-2/+0
Fix the follow warning reported by sparse: arch/s390/boot/vmem.c:170:15: warning: unused variable ‘entry’ [-Wunused-variable] 170 | pte_t entry; | ^~~~~ Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-10-16s390/kasan: handle DCSS mapping in memory holesVasily Gorbik1-1/+6
When physical memory is defined under z/VM using DEF STOR CONFIG, there may be memory holes that are not hotpluggable memory. In such cases, DCSS mapping could be placed in one of these memory holes. Subsequently, attempting memory access to such DCSS mapping would result in a kasan failure because there is no shadow memory mapping for it. To maintain consistency with cases where DCSS mapping is positioned after the kernel identity mapping, which is then covered by kasan zero shadow mapping, handle the scenario above by populating zero shadow mapping for memory holes where DCSS mapping could potentially be placed. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19s390/ctlreg: add struct ctlregHeiko Carstens1-4/+4
Add struct ctlreg to enforce strict type checking / usage for control register functions. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19s390/ctlreg: use local_ctl_load() and local_ctl_store() where possibleHeiko Carstens1-3/+3
Convert all single control register usages of __local_ctl_load() and __local_ctl_store() to local_ctl_load() and local_ctl_store(). This also requires to change the type of some struct lowcore members from __u64 to unsigned long. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19s390/ctlreg: add local and system prefix to some functionsHeiko Carstens1-3/+3
Add local and system prefix to some functions to clarify they change control register contents on either the local CPU or the on all CPUs. This results in the following API: Two defines which load and save multiple control registers. The defines correlate with the following C prototypes: void __local_ctl_load(unsigned long *, unsigned int cr_low, unsigned int cr_high); void __local_ctl_store(unsigned long *, unsigned int cr_low, unsigned int cr_high); Two functions which locally set or clear one bit for a specified control register: void local_ctl_set_bit(unsigned int cr, unsigned int bit); void local_ctl_clear_bit(unsigned int cr, unsigned int bit); Two functions which set or clear one bit for a specified control register on all CPUs: void system_ctl_set_bit(unsigned int cr, unsigned int bit); void system_ctl_clear_bit(unsigend int cr, unsigned int bit); Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19s390/ctlreg: rename ctl_reg.h to ctlreg.hHeiko Carstens1-1/+1
Rename ctl_reg.h to ctlreg.h so it matches not only ctlreg.c but also other control register related function, union, and structure names, which all come with a ctlreg prefix. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19s390/ctlreg: move control register code to separate fileHeiko Carstens1-0/+1
Control register handling has nothing to do with low level SMP code. Move it to a separate file. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-08-30s390/mm: simplify kernel mapping setupHeiko Carstens1-3/+9
The kernel mapping is setup in two stages: in the decompressor map all pages with RWX permissions, and within the kernel change all mappings to their final permissions, where most of the mappings are changed from RWX to RWNX. Change this and map all pages RWNX from the beginning, however without enabling noexec via control register modification. This means that effectively all pages are used with RWX permissions like before. When the final permissions have been applied to the kernel mapping enable noexec via control register modification. This allows to remove quite a bit of non-obvious code. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-06-20s390/kasan: avoid short by one page shadow memoryAlexander Gordeev1-4/+11
Kernel Address Sanitizer uses 3 bits per byte to encode memory. That is the number of bits the start and end address of a memory range is shifted right when the corresponding shadow memory is created for that memory range. The used memory mapping routine expects page-aligned addresses, while the above described 3-bit shift might turn the shadow memory range start and end boundaries into non-page-aligned in case the size of the original memory range is less than (PAGE_SIZE << 3). As result, the resulting shadow memory range could be short on one page. Align on page boundary the start and end addresses when mapping a shadow memory range and avoid the described issue in the future. Note, that does not fix a real problem, since currently no virtual regions of size less than (PAGE_SIZE << 3) exist. Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-04-13s390/mm: fix direct map accountingHeiko Carstens1-2/+16
Commit bb1520d581a3 ("s390/mm: start kernel with DAT enabled") did not implement direct map accounting in the early page table setup code. In result the reported values are bogus now: $cat /proc/meminfo ... DirectMap4k: 5120 kB DirectMap1M: 18446744073709546496 kB DirectMap2G: 0 kB Fix this by adding the missing accounting. The result looks sane again: $cat /proc/meminfo ... DirectMap4k: 6156 kB DirectMap1M: 2091008 kB DirectMap2G: 6291456 kB Fixes: bb1520d581a3 ("s390/mm: start kernel with DAT enabled") Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-13s390/mm: rename POPULATE_ONE2ONE to POPULATE_DIRECTHeiko Carstens1-4/+4
Architectures generally use the "direct map" wording for mapping the whole physical memory. Use that wording as well in arch/s390/boot/vmem.c, instead of "one to one" in order to avoid confusion. This also matches what is already done in arch/s390/mm/vmem.c. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-03-20s390/kasan: move shadow mapping to decompressorVasily Gorbik1-11/+216
Since regular paging structs are initialized in decompressor already move KASAN shadow mapping to decompressor as well. This helps to avoid allocating KASAN required memory in 1 large chunk, de-duplicate paging structs creation code and start the uncompressed kernel with KASAN instrumentation right away. This also allows to avoid all pitfalls accidentally calling KASAN instrumented code during KASAN initialization. Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-03-20s390/boot: rework decompressor reserved trackingVasily Gorbik1-63/+6
Currently several approaches for finding unused memory in decompressor are utilized. While "safe_addr" grows towards higher addresses, vmem code allocates paging structures top down. The former requires careful ordering. In addition to that ipl report handling code verifies potential intersections with secure boot certificates on its own. Neither of two approaches are memory holes aware and consistent with each other in low memory conditions. To solve that, existing approaches are generalized and combined together, as well as online memory ranges are now taken into consideration. physmem_info has been extended to contain reserved memory ranges. New set of functions allow to handle reserves and find unused memory. All reserves and memory allocations are "typed". In case of out of memory condition decompressor fails with detailed info on current reserved ranges and usable online memory. Linux version 6.2.0 ... Kernel command line: ... mem=100M Our of memory allocating 100000 bytes 100000 aligned in range 0:5800000 Reserved memory ranges: 0000000000000000 0000000003e33000 DECOMPRESSOR 0000000003f00000 00000000057648a3 INITRD 00000000063e0000 00000000063e8000 VMEM 00000000063eb000 00000000063f4000 VMEM 00000000063f7800 0000000006400000 VMEM 0000000005800000 0000000006300000 KASAN Usable online memory ranges (info source: sclp read info [3]): 0000000000000000 0000000006400000 Usable online memory total: 6400000 Reserved: 61b10a3 Free: 24ef5d Call Trace: (sp:000000000002bd58 [<0000000000012a70>] physmem_alloc_top_down+0x60/0x14c) sp:000000000002bdc8 [<0000000000013756>] _pa+0x56/0x6a sp:000000000002bdf0 [<0000000000013bcc>] pgtable_populate+0x45c/0x65e sp:000000000002be90 [<00000000000140aa>] setup_vmem+0x2da/0x424 sp:000000000002bec8 [<0000000000011c20>] startup_kernel+0x428/0x8b4 sp:000000000002bf60 [<00000000000100f4>] startup_normal+0xd4/0xd4 physmem_alloc_range allows to find free memory in specified range. It should be used for one time allocations only like finding position for amode31 and vmlinux. physmem_alloc_top_down can be used just like physmem_alloc_range, but it also allows multiple allocations per type and tries to merge sequential allocations together. Which is useful for paging structures allocations. If sequential allocations cannot be merged together they are "chained", allowing easy per type reserved ranges enumeration and migration to memblock later. Extra "struct reserved_range" allocated for chaining are not tracked or reserved but rely on the fact that both physmem_alloc_range and physmem_alloc_top_down search for free memory only below current top down allocator position. All reserved ranges should be transferred to memblock before memblock allocations are enabled. The startup code has been reordered to delay any memory allocations until online memory ranges are detected and occupied memory ranges are marked as reserved to be excluded from follow-up allocations. Ipl report certificates are a special case, ipl report certificates list is checked together with other memory reserves until certificates are saved elsewhere. KASAN required memory for shadow memory allocation and mapping is reserved as 1 large chunk which is later passed to KASAN early initialization code. Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-03-20s390/boot: rename mem_detect to physmem_infoVasily Gorbik1-3/+3
In preparation to extending mem_detect with additional information like reserved ranges rename it to more generic physmem_info. This new naming also help to avoid confusion by using more exact terms like "physmem online ranges", etc. Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-02-14s390/mem_detect: do not truncate online memory ranges infoVasily Gorbik1-1/+1
Commit bf64f0517e5d ("s390/mem_detect: handle online memory limit just once") introduced truncation of mem_detect online ranges based on identity mapping size. For kdump case however the full set of online memory ranges has to be feed into memblock_physmem_add so that crashed system memory could be extracted. Instead of truncating introduce a "usable limit" which is respected by mem_detect api. Also add extra online memory ranges iterator which still provides full set of online memory ranges disregarding the "usable limit". Fixes: bf64f0517e5d ("s390/mem_detect: handle online memory limit just once") Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com> Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-02-06s390/boot: avoid page tables memory in kaslrVasily Gorbik1-0/+7
If kernel is build without KASAN support there is a chance that kernel image is going to be positioned by KASLR code to overlap with identity mapping page tables. When kernel is build with KASAN support enabled memory which is potentially going to be used for page tables and KASAN shadow mapping is accounted for in KASLR with the use of kasan_estimate_memory_needs(). Split this function and introduce vmem_estimate_memory_needs() to cover decompressor's vmem identity mapping page tables. Fixes: bb1520d581a3 ("s390/mm: start kernel with DAT enabled") Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-02-06s390/mem_detect: handle online memory limit just onceVasily Gorbik1-9/+6
Introduce mem_detect_truncate() to cut any online memory ranges above established identity mapping size, so that mem_detect users wouldn't have to do it over and over again. Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-31s390/boot: remove pgtable_populate_endVasily Gorbik1-27/+2
setup_vmem() already calls populate for all online memory regions. pgtable_populate_end() could be removed. Also rename pgtable_populate_begin() to pgtable_populate_init(). Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-31s390/boot: avoid mapping standby memoryVasily Gorbik1-5/+11
Commit bb1520d581a3 ("s390/mm: start kernel with DAT enabled") doesn't consider online memory holes due to potential memory offlining and erroneously creates pgtables for stand-by memory, which bear RW+X attribute and trigger a warning: RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x0000000c3fffffff 49G online yes 0-48 0x0000000c40000000-0x0000000c7fffffff 1G offline 49 0x0000000c80000000-0x0000000fffffffff 14G online yes 50-63 0x0000001000000000-0x00000013ffffffff 16G offline 64-79 s390/mm: Found insecure W+X mapping at address 0xc40000000 WARNING: CPU: 14 PID: 1 at arch/s390/mm/dump_pagetables.c:142 note_page+0x2cc/0x2d8 Map only online memory ranges which fit within identity mapping limit. Fixes: bb1520d581a3 ("s390/mm: start kernel with DAT enabled") Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-13s390/mm: allocate Absolute Lowcore Area in decompressorAlexander Gordeev1-0/+6
Move Absolute Lowcore Area allocation to the decompressor. As result, get_abs_lowcore() and put_abs_lowcore() access brackets become really straight and do not require complex execution context analysis and LAP and interrupts tackling. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-13s390/mm: allocate Real Memory Copy Area in decompressorAlexander Gordeev1-0/+15
Move Real Memory Copy Area allocation to the decompressor. As result, memcpy_real() and memcpy_real_iter() movers become usable since the very moment the kernel starts. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-13s390/boot: allow setup of different virtual address typesAlexander Gordeev1-15/+33
Currently the decompressor sets up only identity mapping. Allow adding more address range types as a prerequisite for allocation of kernel fixed mappings. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-01-13s390/mm: start kernel with DAT enabledAlexander Gordeev1-0/+254
The setup of the kernel virtual address space is spread throughout the sources, boot stages and config options like this: 1. The available physical memory regions are queried and stored as mem_detect information for later use in the decompressor. 2. Based on the physical memory availability the virtual memory layout is established in the decompressor; 3. If CONFIG_KASAN is disabled the kernel paging setup code populates kernel pgtables and turns DAT mode on. It uses the information stored at step [1]. 4. If CONFIG_KASAN is enabled the kernel early boot kasan setup populates kernel pgtables and turns DAT mode on. It uses the information stored at step [1]. The kasan setup creates early_pg_dir directory and directly overwrites swapper_pg_dir entries to make shadow memory pages available. Move the kernel virtual memory setup to the decompressor and start the kernel with DAT turned on right from the very first istruction. That completely eliminates the boot phase when the kernel runs in DAT-off mode, simplies the overall design and consolidates pgtables setup. The identity mapping is created in the decompressor, while kasan shadow mappings are still created by the early boot kernel code. Share with decompressor the existing kasan memory allocator. It decreases the size of a newly requested memory block from pgalloc_pos and ensures that kernel image is not overwritten. pgalloc_low and pgalloc_pos pointers are made preserved boot variables for that. Use the bootdata infrastructure to setup swapper_pg_dir and invalid_pg_dir directories used by the kernel later. The interim early_pg_dir directory established by the kasan initialization code gets eliminated as result. As the kernel runs in DAT-on mode only the PSW_KERNEL_BITS define gets PSW_MASK_DAT bit by default. Additionally, the setup_lowcore_dat_off() and setup_lowcore_dat_on() routines get merged, since there is no DAT-off mode stage anymore. The memory mappings are created with RW+X protection that allows the early boot code setting up all necessary data and services for the kernel being booted. Just before the paging is enabled the memory protection is changed to RO+X for text, RO+NX for read-only data and RW+NX for kernel data and the identity mapping. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>