summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-04-29Merge tag 'powerpc-6.4-1' of ↵Linus Torvalds1-14/+7
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for building the kernel using PC-relative addressing on Power10. - Allow HV KVM guests on Power10 to use prefixed instructions. - Unify support for the P2020 CPU (85xx) into a single machine description. - Always build the 64-bit kernel with 128-bit long double. - Drop support for several obsolete 2000's era development boards as identified by Paul Gortmaker. - A series fixing VFIO on Power since some generic changes. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu, Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel Stanley, Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro Yamada, Michael Neuling, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Nick Desaulniers, Nysal Jan K.A, Pali Rohár, Paul Gortmaker, Paul Mackerras, Petr Vaněk, Randy Dunlap, Rob Herring, Sachin Sant, Sean Christopherson, Segher Boessenkool, and Timothy Pearson. * tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/64s: Disable pcrel code model on Clang powerpc: Fix merge conflict between pcrel and copy_thread changes powerpc/configs/powernv: Add IGB=y powerpc/configs/64s: Drop JFS Filesystem powerpc/configs/64s: Use EXT4 to mount EXT2 filesystems powerpc/configs: Make pseries_defconfig an alias for ppc64le_guest powerpc/configs: Make pseries_le an alias for ppc64le_guest powerpc/configs: Incorporate generic kvm_guest.config into guest configs powerpc/configs: Add IBMVETH=y and IBMVNIC=y to guest configs powerpc/configs/64s: Enable Device Mapper options powerpc/configs/64s: Enable PSTORE powerpc/configs/64s: Enable VLAN support powerpc/configs/64s: Enable BLK_DEV_NVME powerpc/configs/64s: Drop REISERFS powerpc/configs/64s: Use SHA512 for module signatures powerpc/configs/64s: Enable IO_STRICT_DEVMEM powerpc/configs/64s: Enable SCHEDSTATS powerpc/configs/64s: Enable DEBUG_VM & other options powerpc/configs/64s: Enable EMULATED_STATS powerpc/configs/64s: Enable KUNIT and most tests ...
2023-04-28Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds4-4/+41
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-06powerc/mm: try VMA lock-based page fault handling firstLaurent Dufour1-0/+37
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Copied from "x86/mm: try VMA lock-based page fault handling first" [ldufour@linux.ibm.com: powerpc/mm: fix mmap_lock bad unlock] Link: https://lkml.kernel.org/r/20230306154244.17560-1-ldufour@linux.ibm.com Link: https://lore.kernel.org/linux-mm/842502FB-F99C-417C-9648-A37D0ECDC9CE@linux.ibm.com Link: https://lkml.kernel.org/r/20230227173632.3292573-32-surenb@google.com Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Sachin Sant <sachinp@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov2-2/+2
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-04powerpc: Use of_address_to_resource()Rob Herring1-14/+7
Replace open coded reading of "reg" or of_get_address()/ of_translate_address() calls with a single call to of_address_to_resource(). Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230329220337.141295-1-robh@kernel.org
2023-04-04powerpc/papr_scm: Update the NUMA distance table for the target nodeAneesh Kumar K.V1-0/+1
Platform device helper routines won't update the NUMA distance table while creating a platform device, even if the device is present on a NUMA node that doesn't have memory or CPU. This is especially true for pmem devices. If the target node of the pmem device is not online, we find the nearest online node to the device and associate the pmem device with that online node. To find the nearest online node, we should have the numa distance table updated correctly. Update the distance information during the device probe. For a papr scm device on NUMA node 3 distance_lookup_table value for distance_ref_points_depth = 2 before and after fix is below: Before fix: node 3 distance depth 0 - 0 node 3 distance depth 1 - 0 node 4 distance depth 0 - 4 node 4 distance depth 1 - 2 node 5 distance depth 0 - 5 node 5 distance depth 1 - 1 After fix node 3 distance depth 0 - 3 node 3 distance depth 1 - 1 node 4 distance depth 0 - 4 node 4 distance depth 1 - 2 node 5 distance depth 0 - 5 node 5 distance depth 1 - 1 Without the fix, the nearest numa node to the pmem device (NUMA node 3) will be picked as 4. After the fix, we get the correct numa node which is 5. Fixes: da1115fdbd6e ("powerpc/nvdimm: Pick nearby online node if the device node is not online") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230404041433.1781804-1-aneesh.kumar@linux.ibm.com
2023-03-29lazy tlb: introduce lazy tlb mm refcount helper functionsNicholas Piggin1-2/+2
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-15powerpc/mm: Fix false detection of read faultsRussell Currey1-3/+8
To support detection of read faults with Radix execute-only memory, the vma_is_accessible() check in access_error() (which checks for PROT_NONE) was replaced with a check to see if VM_READ was missing, and if so, returns true to assert the fault was caused by a bad read. This is incorrect, as it ignores that both VM_WRITE and VM_EXEC imply read on powerpc, as defined in protection_map[]. This causes mappings containing VM_WRITE or VM_EXEC without VM_READ to misreport the cause of page faults, since the MMU is still allowing reads. Correct this by restoring the original vma_is_accessible() check for PROT_NONE mappings, and adding a separate check for Radix PROT_EXEC-only mappings. Fixes: 395cac7752b9 ("powerpc/mm: Support execute-only memory on the Radix MMU") Reported-by: Michal Suchánek <msuchanek@suse.de> Link: https://lore.kernel.org/r/20230308152702.GR19419@kitsune.suse.cz Tested-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230310050834.63105-1-ruscur@russell.cc
2023-02-25Merge tag 'powerpc-6.3-1' of ↵Linus Torvalds5-40/+44
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Support for configuring secure boot with user-defined keys on PowerVM LPARs - Simplify the replay of soft-masked IRQs by making it non-recursive - Add support for KCSAN on 64-bit Book3S - Improvements to the API & code which interacts with RTAS (pseries firmware) - Change 32-bit powermac to assign PCI bus numbers per domain by default - Some improvements to the 32-bit BPF JIT - Various other small features and fixes Thanks to Anders Roxell, Andrew Donnellan, Andrew Jeffery, Benjamin Gray, Christophe Leroy, Frederic Barrat, Ganesh Goudar, Geoff Levand, Greg Kroah-Hartman, Jan-Benedict Glaw, Josh Poimboeuf, Kajol Jain, Laurent Dufour, Mahesh Salgaonkar, Mathieu Desnoyers, Mimi Zohar, Murphy Zhou, Nathan Chancellor, Nathan Lynch, Nayna Jain, Nicholas Piggin, Pali Rohár, Petr Mladek, Rohan McLure, Russell Currey, Sachin Sant, Sathvika Vasireddy, Sourabh Jain, Stefan Berger, Stephen Rothwell, and Sudhakar Kuppusamy. * tag 'powerpc-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (114 commits) powerpc/pseries: Avoid hcall in plpks_is_available() on non-pseries powerpc: dts: turris1x.dts: Set lower priority for CPLD syscon-reboot powerpc/e500: Add missing prototype for 'relocate_init' powerpc/64: Fix unannotated intra-function call warning powerpc/epapr: Don't use wrteei on non booke powerpc: Pass correct CPU reference to assembler powerpc/mm: Rearrange if-else block to avoid clang warning powerpc/nohash: Fix build with llvm-as powerpc/nohash: Fix build error with binutils >= 2.38 powerpc/pseries: Fix endianness issue when parsing PLPKS secvar flags macintosh: windfarm: Use unsigned type for 1-bit bitfields powerpc/kexec_file: print error string on usable memory property update failure powerpc/machdep: warn when machine_is() used too early powerpc/64: Replace -mcpu=e500mc64 by -mcpu=e5500 powerpc/eeh: Set channel state after notifying the drivers selftests/powerpc: Fix incorrect kernel headers search path powerpc/rtas: arch-wide function token lookup conversions powerpc/rtas: introduce rtas_function_token() API powerpc/pseries/lpar: convert to papr_sysparm API powerpc/pseries/hv-24x7: convert to papr_sysparm API ...
2023-02-24Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-20powerpc/e500: Add missing prototype for 'relocate_init'Christophe Leroy1-0/+1
Kernel test robot reports: arch/powerpc/mm/nohash/e500.c:314:21: warning: no previous prototype for 'relocate_init' [-Wmissing-prototypes] 314 | notrace void __init relocate_init(u64 dt_ptr, phys_addr_t start) | ^~~~~~~~~~~~~ Add it in mm/mmu_decl.h, close to associated is_second_reloc variable declaration. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/oe-kbuild-all/202302181136.wgyCKUcs-lkp@intel.com/ Link: https://lore.kernel.org/r/ac9107acf24135e1a07e8f84d2090572d43e3fe4.1676712510.git.christophe.leroy@csgroup.eu
2023-02-16powerpc/mm: Rearrange if-else block to avoid clang warningAnders Roxell1-7/+4
Clang warns: arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is uninitialized when used here __tlbiel_va_range(hstart, hend, pid, ^~~~~~ arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is uninitialized when used here __tlbiel_va_range(hstart, hend, pid, ^~~~ Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend is always initialized to silence the warnings. That will also simplify the 'else' path. Clang is getting confused with these warnings, but the warnings is a false-positive. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220810114318.3220630-1-anders.roxell@linaro.org
2023-02-16powerpc/nohash: Fix build with llvm-asMichael Ellerman1-1/+1
When using the LLVM integrated assembler (llvm-as), the book3e build fails with: arch/powerpc/mm/nohash/tlb_low_64e.S:354:2: error: invalid instruction tlbilxva 0,%r15 ^ tlbilxva is an extended mnemonic for tlbilx, but llvm-as also doesn't support tlbilx, despite it being an e500mc instruction. Fix it by using the existing PPC_TLBILX_VA macro. The resulting binary is identical when building with binutils. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230216112915.1681631-1-mpe@ellerman.id.au
2023-02-16powerpc/nohash: Fix build error with binutils >= 2.38Michael Ellerman1-1/+4
With bintils >= 2.38 the ppc64_book3e_allmodconfig build fails: {standard input}: Assembler messages: {standard input}:196: Error: unrecognized opcode: `lbarx' {standard input}:196: Error: unrecognized opcode: `stbcx.' make[5]: *** [scripts/Makefile.build:252: arch/powerpc/mm/nohash/e500_hugetlbpage.o] Error 1 That happens because the default CPU for that config is e5500, set via CONFIG_TARGET_CPU, and so the assembler is building for e5500, which doesn't support those instructions. Fix it by using machine directives to tell the assembler to assemble the relevant code for e6500, which does support lbarx/stbcx. That is safe because the code already has the CPU_FTR_SMT check, which ensures the lbarx sequence doesn't run on e5500, which doesn't support SMT. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230213112322.998003-1-mpe@ellerman.id.au
2023-02-12Merge branch 'fixes' into nextMichael Ellerman2-1/+25
Merge our fixes branch to bring in some changes that conflict with upcoming next content.
2023-02-10powerpc/64s: Fix stress_hpt memblock alloc alignmentNicholas Piggin1-1/+2
The stress_hpt memblock allocation did not pass in an alignment, which causes a stack dump in early boot (that I missed, oops). Fixes: 6b34a099faa1 ("powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faults") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221216115930.2667772-2-npiggin@gmail.com
2023-02-10mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan1-1/+1
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-08powerpc/64s/radix: Remove TLB_FLUSH_ALL test from range flushesNicholas Piggin1-11/+10
This looks like it came across from x86, but x86 uses TLB_FLUSH_ALL as a parameter to internal functions. Powerpc never sets it anywhere. Remove the associated logic and leave a warning for now. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230203111718.1149852-4-npiggin@gmail.com
2023-02-08powerpc/64s/radix: mm->context.id should always be validNicholas Piggin1-18/+21
The MMU_NO_CONTEXT checks are an unnecessary complication. Make these warn to prepare to remove them in future. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230203111718.1149852-3-npiggin@gmail.com
2023-02-08powerpc/64s/radix: Remove need_flush_all test from radix__tlb_flushNicholas Piggin1-1/+1
need_flush_all is only set by arch code to instruct generic tlb_flush to flush all. It is never set by powerpc, so it can be removed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230203111718.1149852-2-npiggin@gmail.com
2023-01-31powerpc/64s/radix: Fix RWX mapping with relocated kernelMichael Ellerman1-0/+13
If a relocatable kernel is loaded at a non-zero address and told not to relocate to zero (kdump or RELOCATABLE_TEST), the mapping of the interrupt code at zero is left with RWX permissions. That is a security weakness, and leads to a warning at boot if CONFIG_DEBUG_WX is enabled: powerpc/mm: Found insecure W+X mapping at address 00000000056435bc/0xc000000000000000 WARNING: CPU: 1 PID: 1 at arch/powerpc/mm/ptdump/ptdump.c:193 note_page+0x484/0x4c0 CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc1-00001-g8ae8e98aea82-dirty #175 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-dd0dca hv:linux,kvm pSeries NIP: c0000000004a1c34 LR: c0000000004a1c30 CTR: 0000000000000000 REGS: c000000003503770 TRAP: 0700 Not tainted (6.2.0-rc1-00001-g8ae8e98aea82-dirty) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24000220 XER: 00000000 CFAR: c000000000545a58 IRQMASK: 0 ... NIP note_page+0x484/0x4c0 LR note_page+0x480/0x4c0 Call Trace: note_page+0x480/0x4c0 (unreliable) ptdump_pmd_entry+0xc8/0x100 walk_pgd_range+0x618/0xab0 walk_page_range_novma+0x74/0xc0 ptdump_walk_pgd+0x98/0x170 ptdump_check_wx+0x94/0x100 mark_rodata_ro+0x30/0x70 kernel_init+0x78/0x1a0 ret_from_kernel_thread+0x5c/0x64 The fix has two parts. Firstly the pages from zero up to the end of interrupts need to be marked read-only, so that they are left with R-X permissions. Secondly the mapping logic needs to be taught to ensure there is a page boundary at the end of the interrupt region, so that the permission change only applies to the interrupt text, and not the region following it. Fixes: c55d7b5e6426 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230110124753.1325426-2-mpe@ellerman.id.au
2023-01-31powerpc/64s/radix: Fix crash with unaligned relocated kernelMichael Ellerman1-0/+11
If a relocatable kernel is loaded at an address that is not 2MB aligned and told not to relocate to zero, the kernel can crash due to mark_rodata_ro() incorrectly changing some read-write data to read-only. Scenarios where the misalignment can occur are when the kernel is loaded by kdump or using the RELOCATABLE_TEST config option. Example crash with the kernel loaded at 5MB: Run /sbin/init as init process BUG: Unable to handle kernel data access on write at 0xc000000000452000 Faulting instruction address: 0xc0000000005b6730 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 1 PID: 1 Comm: init Not tainted 6.2.0-rc1-00011-g349188be4841 #166 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-5b4c5a hv:linux,kvm pSeries NIP: c0000000005b6730 LR: c000000000ae9ab8 CTR: 0000000000000380 REGS: c000000004503250 TRAP: 0300 Not tainted (6.2.0-rc1-00011-g349188be4841) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44288480 XER: 00000000 CFAR: c0000000005b66ec DAR: c000000000452000 DSISR: 0a000000 IRQMASK: 0 ... NIP memset+0x68/0x104 LR zero_user_segments.constprop.0+0xa8/0xf0 Call Trace: ext4_mpage_readpages+0x7f8/0x830 ext4_readahead+0x48/0x60 read_pages+0xb8/0x380 page_cache_ra_unbounded+0x19c/0x250 filemap_fault+0x58c/0xae0 __do_fault+0x60/0x100 __handle_mm_fault+0x1230/0x1a40 handle_mm_fault+0x120/0x300 ___do_page_fault+0x20c/0xa80 do_page_fault+0x30/0xc0 data_access_common_virt+0x210/0x220 This happens because mark_rodata_ro() tries to change permissions on the range _stext..__end_rodata, but _stext sits in the middle of the 2MB page from 4MB to 6MB: radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec) radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages radix-mmu: Mapped 0x0000000000400000-0x0000000002400000 with 2.00 MiB pages (exec) The logic that changes the permissions assumes the linear mapping was split correctly at boot, so it marks the entire 2MB page read-only. That leads to the write fault above. To fix it, the boot time mapping logic needs to consider that if the kernel is running at a non-zero address then _stext is a boundary where it must split the mapping. That leads to the mapping being split correctly, allowing the rodata permission change to take happen correctly, with no spillover: radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec) radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages radix-mmu: Mapped 0x0000000000400000-0x0000000000500000 with 64.0 KiB pages radix-mmu: Mapped 0x0000000000500000-0x0000000000600000 with 64.0 KiB pages (exec) radix-mmu: Mapped 0x0000000000600000-0x0000000002400000 with 2.00 MiB pages (exec) If the kernel is loaded at a 2MB aligned address, the mapping continues to use 2MB pages as before: radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec) radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages radix-mmu: Mapped 0x0000000000400000-0x0000000002c00000 with 2.00 MiB pages (exec) radix-mmu: Mapped 0x0000000002c00000-0x0000000100000000 with 2.00 MiB pages Fixes: c55d7b5e6426 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230110124753.1325426-1-mpe@ellerman.id.au
2023-01-12powerpc/64s/hash: Make stress_hpt_timer_fn() staticYang Yingliang1-1/+1
stress_hpt_timer_fn() is only used in hash_utils.c, make it static. Fixes: 6b34a099faa1 ("powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faults") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221228093603.3166599-1-yangyingliang@huawei.com
2022-12-19Merge tag 'powerpc-6.2-1' of ↵Linus Torvalds8-5/+165
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add powerpc qspinlock implementation optimised for large system scalability and paravirt. See the merge message for more details - Enable objtool to be built on powerpc to generate mcount locations - Use a temporary mm for code patching with the Radix MMU, so the writable mapping is restricted to the patching CPU - Add an option to build the 64-bit big-endian kernel with the ELFv2 ABI - Sanitise user registers on interrupt entry on 64-bit Book3S - Many other small features and fixes Thanks to Aboorva Devarajan, Angel Iglesias, Benjamin Gray, Bjorn Helgaas, Bo Liu, Chen Lifu, Christoph Hellwig, Christophe JAILLET, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Deming Wang, Disha Goel, Dmitry Torokhov, Finn Thain, Geert Uytterhoeven, Gustavo A. R. Silva, Haowen Bai, Joel Stanley, Jordan Niethe, Julia Lawall, Kajol Jain, Laurent Dufour, Li zeming, Miaoqian Lin, Michael Jeanson, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Miehlbradt, Nicholas Piggin, Pali Rohár, Randy Dunlap, Rohan McLure, Russell Currey, Sathvika Vasireddy, Shaomin Deng, Stephen Kitt, Stephen Rothwell, Thomas Weißschuh, Tiezhu Yang, Uwe Kleine-König, Xie Shaowen, Xiu Jianfeng, XueBing Chen, Yang Yingliang, Zhang Jiaming, ruanjinjie, Jessica Yu, and Wolfram Sang. * tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (181 commits) powerpc/code-patching: Fix oops with DEBUG_VM enabled powerpc/qspinlock: Fix 32-bit build powerpc/prom: Fix 32-bit build powerpc/rtas: mandate RTAS syscall filtering powerpc/rtas: define pr_fmt and convert printk call sites powerpc/rtas: clean up includes powerpc/rtas: clean up rtas_error_log_max initialization powerpc/pseries/eeh: use correct API for error log size powerpc/rtas: avoid scheduling in rtas_os_term() powerpc/rtas: avoid device tree lookups in rtas_os_term() powerpc/rtasd: use correct OF API for event scan rate powerpc/rtas: document rtas_call() powerpc/pseries: unregister VPA when hot unplugging a CPU powerpc/pseries: reset the RCU watchdogs after a LPM powerpc: Take in account addition CPU node when building kexec FDT powerpc: export the CPU node count powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state powerpc/dts/fsl: Fix pca954x i2c-mux node names cxl: Remove unnecessary cxl_pci_window_alignment() selftests/powerpc: Fix resource leaks ...
2022-12-02powerpc/code-patching: Remove protection against patching init addresses ↵Christophe Leroy1-1/+0
after init Once init section is freed, attempting to patch init code ends up in the weed. Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") protected patch_instruction() against that, but it is the responsibility of the caller to ensure that the patched memory is valid. All callers have now been verified and fixed so the check can be removed. This improves ftrace activation by about 2% on 8xx. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/504310828f473d424e2ed229eff57bf075f52796.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faultsNicholas Piggin4-1/+155
This option increases the number of hash misses by limiting the number of kernel HPT entries, by keeping a per-CPU record of the last kernel HPTEs installed, and removing that from the hash table on the next hash insertion. A timer round-robins CPUs removing remaining kernel HPTEs and clearing the TLB (in the case of bare metal) to increase and slightly randomise kernel fault activity. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Add comment about NR_CPUS usage, fixup whitespace] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221024030150.852517-1-npiggin@gmail.com
2022-11-30powerpc/tlb: Add local flush for page given mm_struct and psizeBenjamin Gray1-0/+8
Adds a local TLB flush operation that works given an mm_struct, VA to flush, and page size representation. Most implementations mirror the surrounding code. The book3s/32/tlbflush.h implementation is left as a BUILD_BUG because it is more complicated and not required for anything as yet. This removes the need to create a vm_area_struct, which the temporary patching mm work does not need. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221109045112.187069-8-bgray@linux.ibm.com
2022-11-30powerpc/book3e: remove #include <generated/utsrelease.h>Thomas Weißschuh1-1/+0
Commit 7ad4bd887d27 ("powerpc/book3e: get rid of #include <generated/compile.h>") removed the usage of the define UTS_RELEASE but forgot to drop the include. utsrelease.h is potentially generated on each build. By removing the unused include we can get rid of some spurious recompilations. Fixes: 7ad4bd887d27 ("powerpc/book3e: get rid of #include <generated/compile.h>") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Fix typo in change log and add more explanation] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126051002.123199-2-linux@weissschuh.net
2022-11-30Merge branch 'fixes' into nextMichael Ellerman3-17/+70
Merge our fixes branch to bring in some changes that are prerequisites for work in next.
2022-11-24powerpc: Remove find_current_mm_pte()Christophe Leroy1-2/+2
Last usage of find_current_mm_pte() was removed by commit 15759cb054ef ("powerpc/perf/callchain: Use __get_user_pages_fast in read_user_stack_slow") Remove it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/ec79f462a3bfa8365b7df505e574d5d85246bc68.1646818177.git.christophe.leroy@csgroup.eu
2022-11-09hugetlb: simplify hugetlb handling in follow_page_maskMike Kravetz1-37/+0
During discussions of this series [1], it was suggested that hugetlb handling code in follow_page_mask could be simplified. At the beginning of follow_page_mask, there currently is a call to follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only architecture which provides a follow_huge_addr routine that does not return error. Instead, at each level of the page table a check is made for a hugetlb entry. If a hugetlb entry is found, a call to a routine associated with that entry is made. Currently, there are two checks for hugetlb entries at each page table level. The first check is of the form: if (p?d_huge()) page = follow_huge_p?d(); the second check is of the form: if (is_hugepd()) page = follow_huge_pd(). We can replace these checks, as well as the special handling routines such as follow_huge_p?d() and follow_huge_pd() with a single routine to handle hugetlb vmas. A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the beginning of follow_page_mask. hugetlb_follow_page_mask will use the existing routine huge_pte_offset to walk page tables looking for hugetlb entries. huge_pte_offset can be overwritten by architectures, and already handles special cases such as hugepd entries. [1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.wang@linux.alibaba.com/ [mike.kravetz@oracle.com: remove vma (pmd sharing) per Peter] Link: https://lkml.kernel.org/r/20221028181108.119432-1-mike.kravetz@oracle.com [mike.kravetz@oracle.com: remove left over hugetlb_vma_unlock_read()] Link: https://lkml.kernel.org/r/20221030225825.40872-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220919021348.22151-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-18powerpc/64s: Fix hash__change_memory_range preemption warningNicholas Piggin1-3/+5
stop_machine_cpuslocked takes a mutex so it must be called in a preemptible context, so it can't simply be fixed by disabling preemption. This is not a bug, because CPU hotplug is locked, so this processor will call in to the stop machine function. So raw_smp_processor_id() could be used. This leaves a small chance that this thread will be migrated to another CPU, so the master work would be done by a CPU from a different context. Better for test coverage to make that a common case by just having the first CPU to call in become the master. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221013151647.1857994-2-npiggin@gmail.com
2022-10-18powerpc/64s: make linear_map_hash_lock a raw spinlockNicholas Piggin1-6/+6
This lock is taken while the raw kfence_freelist_lock is held, so it must also be a raw spinlock, as reported by lockdep when raw lock nesting checking is enabled. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221013230710.1987253-3-npiggin@gmail.com
2022-10-18powerpc/64s: make HPTE lock and native_tlbie_lock irq-safeNicholas Piggin1-2/+25
With kfence enabled, there are several cases where HPTE and TLBIE locks are called from softirq context, for example: WARNING: inconsistent lock state 6.0.0-11845-g0cbbc95b12ac #1 Tainted: G N -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes: c000000002734de8 (native_tlbie_lock){+.?.}-{2:2}, at: .native_hpte_updateboltedpp+0x1a4/0x600 {IN-SOFTIRQ-W} state was registered at: .lock_acquire+0x20c/0x520 ._raw_spin_lock+0x4c/0x70 .native_hpte_invalidate+0x62c/0x840 .hash__kernel_map_pages+0x450/0x640 .kfence_protect+0x58/0xc0 .kfence_guarded_free+0x374/0x5a0 .__slab_free+0x3d0/0x630 .put_cred_rcu+0xcc/0x120 .rcu_core+0x3c4/0x14e0 .__do_softirq+0x1dc/0x7dc .do_softirq_own_stack+0x40/0x60 Fix this by consistently disabling irqs while taking either of these locks. Don't just disable bh because several of the more common cases already disable irqs, so this just makes the locks always irq-safe. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221013230710.1987253-2-npiggin@gmail.com
2022-10-18powerpc/64s: Add lockdep for HPTE lockNicholas Piggin1-7/+35
Add lockdep annotation for the HPTE bit-spinlock. Modern systems don't take the tlbie lock, so this shows up some of the same lockdep warnings that were being reported by the ppc970. And they're not taken in exactly the same places so this is nice to have in its own right. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221013230710.1987253-1-npiggin@gmail.com
2022-10-11Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds2-16/+8
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-10Merge tag 'powerpc-6.1-1' of ↵Linus Torvalds20-160/+128
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Remove our now never-true definitions for pgd_huge() and p4d_leaf(). - Add pte_needs_flush() and huge_pmd_needs_flush() for 64-bit. - Add support for syscall wrappers. - Add support for KFENCE on 64-bit. - Update 64-bit HV KVM to use the new guest state entry/exit accounting API. - Support execute-only memory when using the Radix MMU (P9 or later). - Implement CONFIG_PARAVIRT_TIME_ACCOUNTING for pseries guests. - Updates to our linker script to move more data into read-only sections. - Allow the VDSO to be randomised on 32-bit. - Many other small features and fixes. Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Christophe Leroy, David Hildenbrand, Disha Goel, Fabiano Rosas, Gaosheng Cui, Gustavo A. R. Silva, Haren Myneni, Hari Bathini, Jilin Yuan, Joel Stanley, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He, Li Huafei, Lukas Bulwahn, Madhavan Srinivasan, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Pali Rohár, Rohan McLure, Russell Currey, Sachin Sant, Segher Boessenkool, Shrikanth Hegde, Tyrel Datwyler, Wolfram Sang, ye xingchen, and Zheng Yongjun. * tag 'powerpc-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits) KVM: PPC: Book3S HV: Fix stack frame regs marker powerpc: Don't add __powerpc_ prefix to syscall entry points powerpc/64s/interrupt: Fix stack frame regs marker powerpc/64: Fix msr_check_and_set/clear MSR[EE] race powerpc/64s/interrupt: Change must-hard-mask interrupt check from BUG to WARN powerpc/pseries: Add firmware details to the hardware description powerpc/powernv: Add opal details to the hardware description powerpc: Add device-tree model to the hardware description powerpc/64: Add logical PVR to the hardware description powerpc: Add PVR & CPU name to hardware description powerpc: Add hardware description string powerpc/configs: Enable PPC_UV in powernv_defconfig powerpc/configs: Update config files for removed/renamed symbols powerpc/mm: Fix UBSAN warning reported on hugetlb powerpc/mm: Always update max/min_low_pfn in mem_topology_setup() powerpc/mm/book3s/hash: Rename flush_tlb_pmd_range powerpc: Drops STABS_DEBUG from linker scripts powerpc/64s: Remove lost/old comment powerpc/64s: Remove old STAB comment powerpc: remove orphan systbl_chk.sh ...
2022-09-30powerpc/mm: Fix UBSAN warning reported on hugetlbAneesh Kumar K.V1-3/+3
Powerpc architecture supports 16GB hugetlb pages with hash translation. For 4K page size, this is implemented as a hugepage directory entry at PGD level and for 64K it is implemented as a huge page pte at PUD level With 16GB hugetlb size, offset within a page is greater than 32 bits. Hence switch to use unsigned long type when using hugepd_shift. In order to keep things simpler, we make sure we always use unsigned long type when using hugepd_shift() even though all the hugetlb page size won't require that. The hugetlb_free_p*d_range changes are all related to nohash usage where we can have multiple pgd entries pointing to the same hugepd entries. Hence on book3s64 where we can have > 4GB hugetlb page size we will always find more < next even if we compute the value of more correctly. Hence there is no functional change in this patch except that it fixes the below warning. UBSAN: shift-out-of-bounds in arch/powerpc/mm/hugetlbpage.c:499:21 shift exponent 34 is too large for 32-bit type 'int' CPU: 39 PID: 1673 Comm: a.out Not tainted 6.0.0-rc2-00327-gee88a56e8517-dirty #1 Call Trace: dump_stack_lvl+0x98/0xe0 (unreliable) ubsan_epilogue+0x18/0x70 __ubsan_handle_shift_out_of_bounds+0x1bc/0x390 hugetlb_free_pgd_range+0x5d8/0x600 free_pgtables+0x114/0x290 exit_mmap+0x150/0x550 mmput+0xcc/0x210 do_exit+0x420/0xdd0 do_group_exit+0x4c/0xd0 sys_exit_group+0x24/0x30 system_call_exception+0x250/0x600 system_call_common+0xec/0x250 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Drop generic change to be sent separately, change 1ULL to 1UL] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908072440.258301-1-aneesh.kumar@linux.ibm.com
2022-09-30powerpc/mm: Always update max/min_low_pfn in mem_topology_setup()Aneesh Kumar K.V1-3/+3
For both CONFIG_NUMA enabled/disabled use mem_topology_setup() to update max/min_low_pfn. This also adds min_low_pfn update to CONFIG_NUMA which was initialized to zero before. (mpe: Though MEMORY_START is == 0 for PPC64=y which is all possible NUMA=y systems) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220704063851.295482-1-aneesh.kumar@linux.ibm.com
2022-09-30powerpc/mm/book3s/hash: Rename flush_tlb_pmd_rangeAneesh Kumar K.V2-2/+2
This function does the hash page table update. Hence rename it to indicate this better to avoid confusion with flush_pmd_tlb_range() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Drop unnecessary extern] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220907081941.209501-1-aneesh.kumar@linux.ibm.com
2022-09-28powerpc: Ignore DSI error caused by the copy/paste instructionHaren Myneni1-1/+16
The data storage interrupt (DSI) error will be generated when the paste operation is issued on the suspended Nest Accelerator (NX) window due to NX state changes. The hypervisor expects the partition to ignore this error during page fault handling. To differentiate DSI caused by an actual HW configuration or by the NX window, a new “ibm,pi-features” type value is defined. Byte 0, bit 3 of pi-attribute-specifier-type is now defined to indicate this DSI error. If this error is not ignored, the user space can get SIGBUS when the NX request is issued. This patch adds changes to read ibm,pi-features property and ignore DSI error during page fault handling if MMU_FTR_NX_DSI is defined. Signed-off-by: Haren Myneni <haren@linux.ibm.com> [mpe: Mention PAPR version in comment] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b9cd844b85eb8f70459109ce1b14e44c4cc85fa7.camel@linux.ibm.com
2022-09-28powerpc/64e: provide an addressing macro for use with TOC in alternate registerNicholas Piggin1-1/+1
The interrupt entry code carefully saves a minimal number of registers, so in some places the TOC is required, it is loaded into a different register, so provide a macro that can supply an alternate TOC register. This continues to use got addressing because TOC-relative results in "got/toc optimization is not supported" messages by the linker. Having r2 be one of the saved registers and using that for TOC addressing may be the best way to avoid that and switch this to TOC addressing. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926034057.2360083-6-npiggin@gmail.com
2022-09-28powerpc/64: provide a helper macro to load r2 with the kernel TOCNicholas Piggin1-1/+1
A later change stops the kernel using r2 and loads it with a poison value. Provide a PACATOC loading abstraction which can hide this detail. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926034057.2360083-5-npiggin@gmail.com
2022-09-28powerpc/64s: Enable KFENCE on book3s64Nicholas Miehlbradt2-7/+9
KFENCE support was added for ppc32 in commit 90cbac0e995d ("powerpc: Enable KFENCE for PPC32"). Enable KFENCE on ppc64 architecture with hash and radix MMUs. It uses the same mechanism as debug pagealloc to protect/unprotect pages. All KFENCE kunit tests pass on both MMUs. KFENCE memory is initially allocated using memblock but is later marked as SLAB allocated. This necessitates the change to __pud_free to ensure that the KFENCE pages are freed appropriately. Based on previous work by Christophe Leroy and Jordan Niethe. Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com> Reviewed-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926075726.2846-4-nicholas@linux.ibm.com
2022-09-28powerpc/64s: Allow double call of kernel_[un]map_linear_page()Christophe Leroy1-1/+7
If the page is already mapped resp. already unmapped, bail out. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926075726.2846-3-nicholas@linux.ibm.com
2022-09-28powerpc/64s: Remove unneeded #ifdef CONFIG_DEBUG_PAGEALLOC in hash_utilsChristophe Leroy1-7/+2
debug_pagealloc_enabled() is always defined and constant folds to 'false' when CONFIG_DEBUG_PAGEALLOC is not enabled. Remove the #ifdefs, the code and associated static variables will be optimised out by the compiler when CONFIG_DEBUG_PAGEALLOC is not defined. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926075726.2846-2-nicholas@linux.ibm.com
2022-09-28powerpc/64s: Add DEBUG_PAGEALLOC for radixNicholas Miehlbradt1-4/+14
There is support for DEBUG_PAGEALLOC on hash but not on radix. Add support on radix. Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220926075726.2846-1-nicholas@linux.ibm.com
2022-09-27powerpc: remove mmap linked list walksMatthew Wilcox (Oracle)2-16/+8
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-34-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flushYang Shi1-9/+0
The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB. Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26powerpc: Remove impossible mmu_psize_defs[] on nohashChristophe Leroy1-49/+15
Today there is: if e500 or 8xx if e500 mmu_psize_defs[] = else if 8xx mmu_psize_defs[] = else mmu_psize_defs[] = endif endif The else leg is dead definition. Drop that else leg and rewrite as: if e500 mmu_psize_defs[] = endif if 8xx mmu_psize_defs[] = endif Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/030a843449f348c0b709ca5349640624f36a016f.1663606876.git.christophe.leroy@csgroup.eu