summaryrefslogtreecommitdiff
path: root/arch/x86/mm/init.c
AgeCommit message (Collapse)AuthorFilesLines
2023-05-17x86/mm: Avoid incomplete Global INVLPG flushesDave Hansen1-0/+25
The INVLPG instruction is used to invalidate TLB entries for a specified virtual address. When PCIDs are enabled, INVLPG is supposed to invalidate TLB entries for the specified address for both the current PCID *and* Global entries. (Note: Only kernel mappings set Global=1.) Unfortunately, some INVLPG implementations can leave Global translations unflushed when PCIDs are enabled. As a workaround, never enable PCIDs on affected processors. I expect there to eventually be microcode mitigations to replace this software workaround. However, the exact version numbers where that will happen are not known today. Once the version numbers are set in stone, the processor list can be tweaked to only disable PCIDs on affected processors with affected microcode. Note: if anyone wants a quick fix that doesn't require patching, just stick 'nopcid' on your kernel command-line. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-03-16x86/uaccess: Provide untagged_addr() and remove tags before address checkKirill A. Shutemov1-0/+5
untagged_addr() is a helper used by the core-mm to strip tag bits and get the address to the canonical shape based on rules of the current thread. It only handles userspace addresses. The untagging mask is stored in per-CPU variable and set on context switching to the task. The tags must not be included into check whether it's okay to access the userspace address. Strip tags in access_ok(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-7-kirill.shutemov%40linux.intel.com
2023-03-06x86/paravirt: Merge activate_mm() and dup_mmap() callbacksJuergen Gross1-1/+1
The two paravirt callbacks .mmu.activate_mm() and .mmu.dup_mmap() are sharing the same implementations in all cases: for Xen PV guests they are pinning the PGD of the new mm_struct, and for all other cases they are a NOP. In the end, both callbacks are meant to register an address space with the underlying hypervisor, so there needs to be only a single callback for that purpose. So merge them to a common callback .mmu.enter_mmap() (in contrast to the corresponding already existing .mmu.exit_mmap()). As the first parameter of the old callbacks isn't used, drop it from the replacement one. [ bp: Remove last occurrence of paravirt_activate_mm() in asm/mmu_context.h ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Link: https://lore.kernel.org/r/20230207075902.7539-1-jgross@suse.com
2023-01-12x86/mm: fix poking_init() for Xen PV guestsJuergen Gross1-0/+4
Commit 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") broke the kernel for running as Xen PV guest. It seems as if the new address space is never activated before being used, resulting in Xen rejecting to accept the new CR3 value (the PGD isn't pinned). Fix that by adding the now missing call of paravirt_arch_dup_mmap() to poking_init(). That call was previously done by dup_mm()->dup_mmap() and it is a NOP for all cases but for Xen PV, where it is just doing the pinning of the PGD. Fixes: 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230109150922.10578-1-jgross@suse.com
2022-12-15x86/mm: Use mm_alloc() in poking_init()Peter Zijlstra1-1/+1
Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org
2022-09-27mm/swap: cache maximum swapfile size when init swapPeter Xu1-1/+1
We used to have swapfile_maximum_size() fetching a maximum value of swapfile size per-arch. As the caller of max_swapfile_size() grows, this patch introduce a variable "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so that we don't need to calculate the value every time. Caching the value in swapfile_init() is safe because when reaching the phase we should have initialized all the relevant information. Here the major arch to take care of is x86, which defines the max swapfile size based on L1TF mitigation. Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly when reaching swapfile_init(). As a reference, the code path looks like this for x86: - start_kernel - setup_arch - early_cpu_init - early_identify_cpu --> setup X86_BUG_L1TF - parse_early_param - l1tf_cmdline --> set l1tf_mitigation - check_bugs - l1tf_select_mitigation --> set l1tf_mitigation - arch_call_rest_init - rest_init - kernel_init - kernel_init_freeable - do_basic_setup - do_initcalls --> calls swapfile_init() (initcall level 4) The swapfile size only depends on swp pte format on non-x86 archs, so caching it is safe too. Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because arch can define its own function, so it's more straightforward to have "arch_" as its prefix. At the meantime, export swapfile_maximum_size to replace the old usages of max_swapfile_size(). [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h] Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-01Merge tag 'x86_cleanups_for_v6.0_rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanup from Borislav Petkov: - A single CONFIG_ symbol correction in a comment * tag 'x86_cleanups_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Refer to the intended config STRICT_DEVMEM in a comment
2022-07-13x86/pat: Fix x86_has_pat_wp()Juergen Gross1-2/+12
x86_has_pat_wp() is using a wrong test, as it relies on the normal PAT configuration used by the kernel. In case the PAT MSR has been setup by another entity (e.g. Xen hypervisor) it might return false even if the PAT configuration is allowing WP mappings. This due to the fact that when running as Xen PV guest the PAT MSR is setup by the hypervisor and cannot be changed by the guest. This results in the WP related entry to be at a different position when running as Xen PV guest compared to the bare metal or fully virtualized case. The correct way to test for WP support is: 1. Get the PTE protection bits needed to select WP mode by reading __cachemode2pte_tbl[_PAGE_CACHE_MODE_WP] (depending on the PAT MSR setting this might return protection bits for a stronger mode, e.g. UC-) 2. Translate those bits back into the real cache mode selected by those PTE bits by reading __pte2cachemode_tbl[__pte2cm_idx(prot)] 3. Test for the cache mode to be _PAGE_CACHE_MODE_WP Fixes: f88a68facd9a ("x86/mm: Extend early_memremap() support with additional attrs") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.14 Link: https://lore.kernel.org/r/20220503132207.17234-1-jgross@suse.com
2022-07-07x86/mm: Refer to the intended config STRICT_DEVMEM in a commentLukas Bulwahn1-1/+1
Commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads") adds a comment to the function devmem_is_allowed() referring to a non-existing config STRICT_IOMEM, whereas the comment very likely intended to refer to the config STRICT_DEVMEM, as the commit adds some behavior for the config STRICT_DEVMEM. Most of the initial analysis was actually done by Dave Hansen in the email thread below (see Link). Refer to the intended and existing config STRICT_DEVMEM. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/f9074e8d-9314-9d7d-7bf5-5b5538c8be8d@intel.com/ Link: https://lkml.kernel.org/r/20220707115442.21107-1-lukas.bulwahn@gmail.com
2022-03-25mm/migration: add trace events for base page and HugeTLB migrationsAnshuman Khandual1-1/+0
This adds two trace events for base page and HugeTLB page migrations. These events, closely follow the implementation details like setting and removing of PTE migration entries, which are essential operations for migration. The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both <events/migration.h> and <events/tlb.h> based trace events. Hence drop redundant CREATE_TRACE_POINTS from other places which could have otherwise conflicted during build. Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-04x86/realmode: Add comment for Global bit usage in trampoline_pgdJoerg Roedel1-0/+5
Document the fact that using the trampoline_pgd will result in the creation of global TLB entries in the user range of the address space. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211202153226.22946-2-joro@8bytes.org
2021-11-06memblock: rename memblock_free to memblock_phys_freeMike Rapoport1-1/+1
Since memblock_free() operates on a physical range, make its name reflect it and rename it to memblock_phys_free(), so it will be a logical counterpart to memblock_phys_alloc(). The callers are updated with the below semantic patch: @@ expression addr; expression size; @@ - memblock_free(addr, size); + memblock_phys_free(addr, size); Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memblock: make memblock_find_in_range method privateMike Rapoport1-8/+15
There are a lot of uses of memblock_find_in_range() along with memblock_reserve() from the times memblock allocation APIs did not exist. memblock_find_in_range() is the very core of memblock allocations, so any future changes to its internal behaviour would mandate updates of all the users outside memblock. Replace the calls to memblock_find_in_range() with an equivalent calls to memblock_phys_alloc() and memblock_phys_alloc_range() and make memblock_find_in_range() private method of memblock. This simplifies the callers, ensures that (unlikely) errors in memblock_reserve() are handled and improves maintainability of memblock_find_in_range(). Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv] Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29Merge tag 'x86-mm-2021-04-29' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tlb updates from Ingo Molnar: "The x86 MM changes in this cycle were: - Implement concurrent TLB flushes, which overlaps the local TLB flush with the remote TLB flush. In testing this improved sysbench performance measurably by a couple of percentage points, especially if TLB-heavy security mitigations are active. - Further micro-optimizations to improve the performance of TLB flushes" * tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Micro-optimize smp_call_function_many_cond() smp: Inline on_each_cpu_cond() and on_each_cpu() x86/mm/tlb: Remove unnecessary uses of the inline keyword cpumask: Mark functions as pure x86/mm/tlb: Do not make is_lazy dirty for no reason x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() smp: Run functions concurrently in smp_call_function_many_cond()
2021-03-22x86: Fix various typos in comments, take #2Ingo Molnar1-1/+1
Fix another ~42 single-word typos in arch/x86/ code comments, missed a few in the first pass, in particular in .S files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-18x86: Fix various typos in commentsIngo Molnar1-2/+2
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-06x86/mm/tlb: Privatize cpu_tlbstateNadav Amit1-1/+1
cpu_tlbstate is mostly private and only the variable is_lazy is shared. This causes some false-sharing when TLB flushes are performed. Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark each one accordingly. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.com
2021-01-04x86/mm: Increase pgt_buf size for 5-level page tablesLorenzo Stoakes1-5/+14
pgt_buf is used to allocate page tables on initial direct page mapping which bootstraps the kernel into being able to allocate these before the direct mapping makes further pages available. INIT_PGD_PAGE_COUNT is set to 6 pages (doubled for KASLR) - 3 (PUD, PMD, PTE) for the 1 MiB ISA mapping and 3 more for the first direct mapping assignment in each case providing 2 MiB of address space. This has not been updated for 5-level page tables which has an additional P4D page table level above PUD. In most instances, this will not have a material impact as the first 4 page levels allocated for the ISA mapping will provide sufficient address space to encompass all further address mappings. If the first direct mapping is within 512 GiB of the ISA mapping, only a PMD and PTE needs to be added in the instance the kernel is using 4 KiB page tables (e.g. CONFIG_DEBUG_PAGEALLOC is enabled) and only a PMD if the kernel can use 2 MiB pages (the first allocation is limited to PMD_SIZE so a GiB page cannot be used there). However, if the machine has more than 512 GiB of RAM and the kernel is allocating 4 KiB page size, 3 further page tables are required. If the machine has more than 256 TiB of RAM at 4 KiB or 2 MiB page size, further 3 or 4 page tables are required respectively. Update INIT_PGD_PAGE_COUNT to reflect this. [ bp: Sanitize text into passive voice without ambiguous personal pronouns. ] Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/20201215205641.34096-1-lstoakes@gmail.com
2020-11-20x86/mm: Declare 'start' variable where it is usedLukas Bulwahn1-2/+4
It is not required to initialize the local variable start in memory_map_top_down(), as the variable will be initialized in any path before it is used. make clang-analyzer on x86_64 tinyconfig reports: arch/x86/mm/init.c:612:15: warning: Although the value stored to 'start' \ is used in the enclosing expression, the value is never actually read \ from 'start' [clang-analyzer-deadcode.DeadStores] Move the variable declaration into the loop, where it is used. No code changed: # arch/x86/mm/init.o: text data bss dec hex filename 7105 1424 26768 35297 89e1 init.o.before 7105 1424 26768 35297 89e1 init.o.after md5: a8d76c1bb5fce9cae251780a7ee7730f init.o.before.asm a8d76c1bb5fce9cae251780a7ee7730f init.o.after.asm [ bp: Massage. ] Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200928100004.25674-1-lukas.bulwahn@gmail.com
2020-06-17x86/mm: Fix -Wmissing-prototypes warnings for arch/x86/mm/init.cBenjamin Thiel1-2/+1
Fix -Wmissing-prototypes warnings: arch/x86/mm/init.c:81:6: warning: no previous prototype for ‘x86_has_pat_wp’ [-Wmissing-prototypes] bool x86_has_pat_wp(void) arch/x86/mm/init.c:86:22: warning: no previous prototype for ‘pgprot2cachemode’ [-Wmissing-prototypes] enum page_cache_mode pgprot2cachemode(pgprot_t pgprot) by including the respective header containing prototypes. Also fix: arch/x86/mm/init.c:893:13: warning: no previous prototype for ‘mem_encrypt_free_decrypted_mem’ [-Wmissing-prototypes] void __weak mem_encrypt_free_decrypted_mem(void) { } by making it static inline for the !CONFIG_AMD_MEM_ENCRYPT case. This warning happens when CONFIG_AMD_MEM_ENCRYPT is not enabled (defconfig for example): ./arch/x86/include/asm/mem_encrypt.h:80:27: warning: inline function ‘mem_encrypt_free_decrypted_mem’ declared weak [-Wattributes] static inline void __weak mem_encrypt_free_decrypted_mem(void) { } ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's ok to convert to static inline because the function is used only in x86. Is not shared with other architectures so drop the __weak too. [ bp: Massage and adjust __weak comments while at it. ] Signed-off-by: Benjamin Thiel <b.thiel@posteo.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200606122629.2720-1-b.thiel@posteo.de
2020-06-09x86/mm: simplify init_trampoline() and surrounding logicMike Rapoport1-0/+22
There are three cases for the trampoline initialization: * 32-bit does nothing * 64-bit with kaslr disabled simply copies a PGD entry from the direct map to the trampoline PGD * 64-bit with kaslr enabled maps the real mode trampoline at PUD level These cases are currently differentiated by a bunch of ifdefs inside asm/include/pgtable.h and the case of 64-bits with kaslr on uses pgd_index() helper. Replacing the ifdefs with a static function in arch/x86/mm/init.c gives clearer code and allows moving pgd_index() to the generic implementation in include/linux/pgtable.h [rppt@linux.ibm.com: take CONFIG_RANDOMIZE_MEMORY into account in kaslr_enabled()] Link: http://lkml.kernel.org/r/20200525104045.GB13212@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-8-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05Merge tag 'x86-mm-2020-06-05' of ↵Linus Torvalds1-5/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Misc changes: - Unexport various PAT primitives - Unexport per-CPU tlbstate and uninline TLB helpers" * tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/tlb/uv: Add a forward declaration for struct flush_tlb_info x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m x86/tlb: Restrict access to tlbstate xen/privcmd: Remove unneeded asm/tlb.h include x86/tlb: Move PCID helpers where they are used x86/tlb: Uninline nmi_uaccess_okay() x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site x86/tlb: Move paravirt_tlb_remove_table() to the usage site x86/tlb: Move __flush_tlb_all() out of line x86/tlb: Move flush_tlb_others() out of line x86/tlb: Move __flush_tlb_one_kernel() out of line x86/tlb: Move __flush_tlb_one_user() out of line x86/tlb: Move __flush_tlb_global() out of line x86/tlb: Move __flush_tlb() out of line x86/alternatives: Move temporary_mm helpers into C x86/cr4: Sanitize CR4.PCE update x86/cpu: Uninline CR4 accessors x86/tlb: Uninline __get_current_cr3_fast() x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k() x86/mm: Unexport __cachemode2pte_tbl ...
2020-06-04mm: use free_area_init() instead of free_area_init_nodes()Mike Rapoport1-1/+1
free_area_init() has effectively became a wrapper for free_area_init_nodes() and there is no point of keeping it. Still free_area_init() name is shorter and more general as it does not imply necessity to initialize multiple nodes. Rename free_area_init_nodes() to free_area_init(), update the callers and drop old version of free_area_init(). Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-23x86/mm: Stop printing BRK addressesArvind Sankar1-2/+0
This currently leaks kernel physical addresses into userspace. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/20200229231120.1147527-1-nivedita@alum.mit.edu
2020-04-26x86/tlb: Restrict access to tlbstateThomas Gleixner1-1/+0
Hide tlbstate, flush_tlb_info and related helpers when tlbflush.h is included from a module. Modules have absolutely no business with these internals. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200421092600.328438734@linutronix.de
2020-04-26x86/tlb: Move cr4_set_bits_and_update_boot() to the usage siteThomas Gleixner1-0/+13
No point in having this exposed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200421092559.940978251@linutronix.de
2020-04-23x86/mm: Unexport __cachemode2pte_tblChristoph Hellwig1-2/+9
Exporting the raw data for a table is generally a bad idea. Move cachemode2protval() out of line given that it isn't really used in the fast path, and then mark __cachemode2pte_tbl static. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200408152745.1565832-5-hch@lst.de
2020-04-20x86/mm: Move pgprot2cachemode out of lineChristoph Hellwig1-2/+11
This helper is only used by x86 low-level MM code. Also remove the entirely pointless __pte2cachemode_tbl export as that symbol can be marked static now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200408152745.1565832-3-hch@lst.de
2020-04-20x86/mm: Add a x86_has_pat_wp() helperChristoph Hellwig1-0/+6
Abstract the ioremap code away from the caching mode internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200408152745.1565832-2-hch@lst.de
2020-04-11x86/mm: thread pgprot_t through init_memory_mapping()Logan Gunthorpe1-4/+5
In preparation to support a pgprot_t argument for arch_add_memory(). It's required to move the prototype of init_memory_mapping() seeing the original location came before the definition of pgprot_t. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Badger <ebadger@gigaio.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200306170846.9333-4-logang@deltatee.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-04x86/mm: Report which part of kernel image is freedKees Cook1-4/+4
The memory freeing report wasn't very useful for figuring out which parts of the kernel image were being freed. Add the details for clearer reporting in dmesg. Before: Freeing unused kernel image memory: 1348K Write protecting the kernel read-only data: 20480k Freeing unused kernel image memory: 2040K Freeing unused kernel image memory: 172K After: Freeing unused kernel image (initmem) memory: 1348K Write protecting the kernel read-only data: 20480k Freeing unused kernel image (text/rodata gap) memory: 2040K Freeing unused kernel image (rodata/data gap) memory: 172K Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-28-keescook@chromium.org
2019-04-30x86/alternatives: Initialize temporary mm for patchingNadav Amit1-0/+37
To prevent improper use of the PTEs that are used for text patching, the next patches will use a temporary mm struct. Initailize it by copying the init mm. The address that will be used for patching is taken from the lower area that is usually used for the task memory. Doing so prevents the need to frequently synchronize the temporary-mm (e.g., when BPF programs are installed), since different PGDs are used for the task memory. Finally, randomize the address of the PTEs to harden against exploits that use these PTEs. Suggested-by: Andy Lutomirski <luto@kernel.org> Tested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: ard.biesheuvel@linaro.org Cc: deneen.t.dock@intel.com Cc: kernel-hardening@lists.openwall.com Cc: kristen@linux.intel.com Cc: linux_dti@icloud.com Cc: will.deacon@arm.com Link: https://lkml.kernel.org/r/20190426232303.28381-8-nadav.amit@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-24x86/mm: Fix a crash with kmemleak_scan()Qian Cai1-0/+6
The first kmemleak_scan() call after boot would trigger the crash below because this callpath: kernel_init free_initmem mem_encrypt_free_decrypted_mem free_init_pages unmaps memory inside the .bss when DEBUG_PAGEALLOC=y. kmemleak_init() will register the .data/.bss sections and then kmemleak_scan() will scan those addresses and dereference them looking for pointer references. If free_init_pages() frees and unmaps pages in those sections, kmemleak_scan() will crash if referencing one of those addresses: BUG: unable to handle kernel paging request at ffffffffbd402000 CPU: 12 PID: 325 Comm: kmemleak Not tainted 5.1.0-rc4+ #4 RIP: 0010:scan_block Call Trace: scan_gray_list kmemleak_scan kmemleak_scan_thread kthread ret_from_fork Since kmemleak_free_part() is tolerant to unknown objects (not tracked by kmemleak), it is fine to call it from free_init_pages() even if not all address ranges passed to this function are known to kmemleak. [ bp: Massage. ] Fixes: b3f0907c71e0 ("x86/mm: Add .bss..decrypted section to hold shared variables") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190423165811.36699-1-cai@lca.pw
2018-12-28mm: make free_reserved_area() return "const char *"Alexey Dobriyan1-1/+1
and propagate through down the call stack. Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-11x86/speculation/l1tf: Drop the swap storage limit restriction when l1tf=offMichal Hocko1-1/+1
Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever the system is deemed affected by L1TF vulnerability. Even though the limit is quite high for most deployments it seems to be too restrictive for deployments which are willing to live with the mitigation disabled. We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is clearly out of the limit. Drop the swap restriction when l1tf=off is specified. It also doesn't make much sense to warn about too much memory for the l1tf mitigation when it is forcefully disabled by the administrator. [ tglx: Folded the documentation delta change ] Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: <linux-mm@kvack.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-1/+0
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-15x86/mm: Add .bss..decrypted section to hold shared variablesBrijesh Singh1-0/+4
kvmclock defines few static variables which are shared with the hypervisor during the kvmclock initialization. When SEV is active, memory is encrypted with a guest-specific key, and if the guest OS wants to share the memory region with the hypervisor then it must clear the C-bit before sharing it. Currently, we use kernel_physical_mapping_init() to split large pages before clearing the C-bit on shared pages. But it fails when called from the kvmclock initialization (mainly because the memblock allocator is not ready that early during boot). Add a __bss_decrypted section attribute which can be used when defining such shared variable. The so-defined variables will be placed in the .bss..decrypted section. This section will be mapped with C=0 early during boot. The .bss..decrypted section has a big chunk of memory that may be unused when memory encryption is not active, free it when memory encryption is not active. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Radim Krčmář<rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
2018-08-26Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Correct the L1TF fallout on 32bit and the off by one in the 'too much RAM for protection' calculation. - Add a helpful kernel message for the 'too much RAM' case - Unbreak the VDSO in case that the compiler desides to use indirect jumps/calls and emits retpolines which cannot be resolved because the kernel uses its own thunks, which does not work for the VDSO. Make it use the builtin thunks. - Re-export start_thread() which was unexported when the 32/64bit implementation was unified. start_thread() is required by modular binfmt handlers. - Trivial cleanups * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/l1tf: Suggest what to do on systems with too much RAM x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM x86/kvm/vmx: Remove duplicate l1d flush definitions x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit x86/process: Re-export start_thread() x86/mce: Add notifier_block forward declaration x86/vdso: Fix vDSO build if a retpoline is emitted
2018-08-24x86/speculation/l1tf: Fix off-by-one error when warning that system has too ↵Vlastimil Babka1-1/+1
much RAM Two users have reported [1] that they have an "extremely unlikely" system with more than MAX_PA/2 memory and L1TF mitigation is not effective. In fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to holes in the e820 map, the main region is almost 500MB over the 32GB limit: [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the 500MB revealed, that there's an off-by-one error in the check in l1tf_select_mitigation(). l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range check in the mitigation path does not take this into account. Instead of amending the range check, make l1tf_pfn_limit() return the first PFN which is over the limit which is less error prone. Adjust the other users accordingly. [1] https://bugzilla.suse.com/show_bug.cgi?id=1105536 Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Reported-by: George Anchev <studio@anchev.net> Reported-by: Christopher Snowhill <kode54@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
2018-08-20x86/xen: enable early use of set_fixmap in 32-bit Xen PV guestJuergen Gross1-5/+12
Commit 7b25b9cb0dad83 ("x86/xen/time: Initialize pv xen time in init_hypervisor_platform()") moved the mapping of the shared info area before pagetable_init(). This breaks booting as 32-bit PV guest as the use of set_fixmap isn't possible at this time on 32-bit. This can be worked around by populating the needed PMD on 32-bit kernel earlier. In order not to reimplement populate_extra_pte() using extend_brk() for allocating new page tables extend alloc_low_pages() to do that in case the early page table pool is not yet available. Fixes: 7b25b9cb0dad83 ("x86/xen/time: Initialize pv xen time in init_hypervisor_platform()") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2018-08-20x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bitVlastimil Babka1-2/+2
On 32bit PAE kernels on 64bit hardware with enough physical bits, l1tf_pfn_limit() will overflow unsigned long. This in turn affects max_swapfile_size() and can lead to swapon returning -EINVAL. This has been observed in a 32bit guest with 42 bits physical address size, where max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces the following warning to dmesg: [ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k Fix this by using unsigned long long instead. Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Reported-by: Dominique Leuenberger <dimstar@suse.de> Reported-by: Adrian Schroeter <adrian@suse.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
2018-08-14x86/init: fix build with CONFIG_SWAP=nVlastimil Babka1-0/+2
The introduction of generic_max_swapfile_size and arch-specific versions has broken linking on x86 with CONFIG_SWAP=n due to undefined reference to 'generic_max_swapfile_size'. Fix it by compiling the x86-specific max_swapfile_size() only with CONFIG_SWAP=y. Reported-by: Tomas Pruzina <pruzinat@gmail.com> Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-14Merge branch 'l1tf-final' of ↵Linus Torvalds1-0/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge L1 Terminal Fault fixes from Thomas Gleixner: "L1TF, aka L1 Terminal Fault, is yet another speculative hardware engineering trainwreck. It's a hardware vulnerability which allows unprivileged speculative access to data which is available in the Level 1 Data Cache when the page table entry controlling the virtual address, which is used for the access, has the Present bit cleared or other reserved bits set. If an instruction accesses a virtual address for which the relevant page table entry (PTE) has the Present bit cleared or other reserved bits set, then speculative execution ignores the invalid PTE and loads the referenced data if it is present in the Level 1 Data Cache, as if the page referenced by the address bits in the PTE was still present and accessible. While this is a purely speculative mechanism and the instruction will raise a page fault when it is retired eventually, the pure act of loading the data and making it available to other speculative instructions opens up the opportunity for side channel attacks to unprivileged malicious code, similar to the Meltdown attack. While Meltdown breaks the user space to kernel space protection, L1TF allows to attack any physical memory address in the system and the attack works across all protection domains. It allows an attack of SGX and also works from inside virtual machines because the speculation bypasses the extended page table (EPT) protection mechanism. The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646 The mitigations provided by this pull request include: - Host side protection by inverting the upper address bits of a non present page table entry so the entry points to uncacheable memory. - Hypervisor protection by flushing L1 Data Cache on VMENTER. - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT by offlining the sibling CPU threads. The knobs are available on the kernel command line and at runtime via sysfs - Control knobs for the hypervisor mitigation, related to L1D flush and SMT control. The knobs are available on the kernel command line and at runtime via sysfs - Extensive documentation about L1TF including various degrees of mitigations. Thanks to all people who have contributed to this in various ways - patches, review, testing, backporting - and the fruitful, sometimes heated, but at the end constructive discussions. There is work in progress to provide other forms of mitigations, which might be less horrible performance wise for a particular kind of workloads, but this is not yet ready for consumption due to their complexity and limitations" * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) x86/microcode: Allow late microcode loading with SMT disabled tools headers: Synchronise x86 cpufeatures.h for L1TF additions x86/mm/kmmio: Make the tracer robust against L1TF x86/mm/pat: Make set_memory_np() L1TF safe x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert x86/speculation/l1tf: Invert all not present mappings cpu/hotplug: Fix SMT supported evaluation KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry x86/speculation: Simplify sysfs report of VMX L1TF vulnerability Documentation/l1tf: Remove Yonah processors from not vulnerable list x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d x86: Don't include linux/irq.h from asm/hardirq.h x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() cpu/hotplug: detect SMT disabled by BIOS ...
2018-08-06x86/mm/init: Remove freed kernel image areas from alias mappingDave Hansen1-2/+24
The kernel image is mapped into two places in the virtual address space (addresses without KASLR, of course): 1. The kernel direct map (0xffff880000000000) 2. The "high kernel map" (0xffffffff81000000) We actually execute out of #2. If we get the address of a kernel symbol, it points to #2, but almost all physical-to-virtual translations point to Parts of the "high kernel map" alias are mapped in the userspace page tables with the Global bit for performance reasons. The parts that we map to userspace do not (er, should not) have secrets. When PTI is enabled then the global bit is usually not set in the high mapping and just used to compensate for poor performance on systems which lack PCID. This is fine, except that some areas in the kernel image that are adjacent to the non-secret-containing areas are unused holes. We free these holes back into the normal page allocator and reuse them as normal kernel memory. The memory will, of course, get *used* via the normal map, but the alias mapping is kept. This otherwise unused alias mapping of the holes will, by default keep the Global bit, be mapped out to userspace, and be vulnerable to Meltdown. Remove the alias mapping of these pages entirely. This is likely to fracture the 2M page mapping the kernel image near these areas, but this should affect a minority of the area. The pageattr code changes *all* aliases mapping the physical pages that it operates on (by default). We only want to modify a single alias, so we need to tweak its behavior. This unmapping behavior is currently dependent on PTI being in place. Going forward, we should at least consider doing this for all configurations. Having an extra read-write alias for memory is not exactly ideal for debugging things like random memory corruption and this does undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page Frame Ownership (XPFO). Before this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd After this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte current_user-0xffffffff82488000-0xffffffff82600000 1504K pte current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd [ tglx: Do not unmap on 32bit as there is only one mapping ] Fixes: 0f561fce4d69 ("x86/pti: Enable global pages for shared areas") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
2018-08-05x86/mm/init: Add helper for freeing kernel image pagesDave Hansen1-3/+12
When chunks of the kernel image are freed, free_init_pages() is used directly. Consolidate the three sites that do this. Also update the string to give an incrementally better description of that memory versus what was there before. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@google.com Cc: aarcange@redhat.com Cc: jgross@suse.com Cc: jpoimboe@redhat.com Cc: gregkh@linuxfoundation.org Cc: peterz@infradead.org Cc: hughd@google.com Cc: torvalds@linux-foundation.org Cc: bp@alien8.de Cc: luto@kernel.org Cc: ak@linux.intel.com Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
2018-06-27x86/speculation/l1tf: Protect PAE swap entries against L1TFVlastimil Babka1-1/+1
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for offset. The lower word is zeroed, thus systems with less than 4GB memory are safe. With 4GB to 128GB the swap type selects the memory locations vulnerable to L1TF; with even more memory, also the swap offfset influences the address. This might be a problem with 32bit PAE guests running on large 64bit hosts. By continuing to keep the whole swap entry in either high or low 32bit word of PTE we would limit the swap size too much. Thus this patch uses the whole PAE PTE with the same layout as the 64bit version does. The macros just become a bit tricky since they assume the arch-dependent swp_entry_t to be 32bit. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com>
2018-06-21x86/speculation/l1tf: Extend 64bit swap file size limitVlastimil Babka1-1/+9
The previous patch has limited swap file size so that large offsets cannot clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation. It assumed that offsets are encoded starting with bit 12, same as pfn. But on x86_64, offsets are encoded starting with bit 9. Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA and 256TB with 46bit MAX_PA. Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-20x86/speculation/l1tf: Limit swap file size to MAX_PA/2Andi Kleen1-0/+15
For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-15mm: fix devmem_is_allowed() for sub-page System RAM intersectionsDan Williams1-1/+3
Hussam reports: I was poking around and for no real reason, I did cat /dev/mem and strings /dev/mem. Then I saw the following warning in dmesg. I saved it and rebooted immediately. memremap attempted on mixed range 0x000000000009c000 size: 0x1000 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11810 at kernel/memremap.c:98 memremap+0x104/0x170 [..] Call Trace: xlate_dev_mem_ptr+0x25/0x40 read_mem+0x89/0x1a0 __vfs_read+0x36/0x170 The memremap() implementation checks for attempts to remap System RAM with MEMREMAP_WB and instead redirects those mapping attempts to the linear map. However, that only works if the physical address range being remapped is page aligned. In low memory we have situations like the following: 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved ...where System RAM intersects Reserved ranges on a sub-page page granularity. Given that devmem_is_allowed() special cases any attempt to map System RAM in the first 1MB of memory, replace page_is_ram() with the more precise region_intersects() to trap attempts to map disallowed ranges. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199999 Link: http://lkml.kernel.org/r/152856436164.18127.2847888121707136898.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 92281dee825f ("arch: introduce memremap()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Hussam Al-Tayeb <me@hussam.eu.org> Tested-by: Hussam Al-Tayeb <me@hussam.eu.org> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>