summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2024-01-09Merge tag 'perf-core-2024-01-08' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: - Add branch stack counters ABI extension to better capture the growing amount of information the PMU exposes via branch stack sampling. There's matching tooling support. - Fix race when creating the nr_addr_filters sysfs file - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU support - Misc cleanups & fixes * tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Factor out topology_gidnid_map() perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology() perf/x86/amd: Reject branch stack for IBS events perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge perf/x86/intel/uncore: Support IIO free-running counters on GNR perf/x86/intel/uncore: Support Granite Rapids perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR perf: Fix the nr_addr_filters fix perf/x86/intel/cstate: Add Grand Ridge support perf/x86/intel/cstate: Add Sierra Forest support x86/smp: Export symbol cpu_clustergroup_mask() perf/x86/intel/cstate: Cleanup duplicate attr_groups perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file perf/x86/intel: Support branch counters logging perf/x86/intel: Reorganize attrs and is_visible perf: Add branch_sample_call_stack perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag perf: Add branch stack counters
2024-01-09Merge tag 'powerpc-6.8-1' of ↵Linus Torvalds81-552/+1507
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add initial support to recognise the HeXin C2000 processor. - Add papr-vpd and papr-sysparm character device drivers for VPD & sysparm retrieval, so userspace tools can be adapted to avoid doing raw firmware calls from userspace. - Sched domains optimisations for shared processor partitions on P9/P10. - A series of optimisations for KVM running as a nested HV under PowerVM. - Other small features and fixes. Thanks to Aditya Gupta, Aneesh Kumar K.V, Arnd Bergmann, Christophe Leroy, Colin Ian King, Dario Binacchi, David Heidelberg, Geoff Levand, Gustavo A. R. Silva, Haoran Liu, Jordan Niethe, Kajol Jain, Kevin Hao, Kunwu Chan, Li kunyu, Li zeming, Masahiro Yamada, Michal Suchánek, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Randy Dunlap, Sathvika Vasireddy, Srikar Dronamraju, Stephen Rothwell, Vaibhav Jain, and Zhao Ke. * tag 'powerpc-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (96 commits) powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2 powerpc/86xx: Drop unused CONFIG_MPC8610 powerpc/powernv: Add error handling to opal_prd_range_is_valid selftests/powerpc: Fix spelling mistake "EACCESS" -> "EACCES" powerpc/hvcall: Reorder Nestedv2 hcall opcodes powerpc/ps3: Add missing set_freezable() for ps3_probe_thread() powerpc/mpc83xx: Use wait_event_freezable() for freezable kthread powerpc/mpc83xx: Add the missing set_freezable() for agent_thread_fn() powerpc/fsl: Fix fsl,tmu-calibration to match the schema powerpc/smp: Dynamically build Powerpc topology powerpc/smp: Avoid asym packing within thread_group of a core powerpc/smp: Add __ro_after_init attribute powerpc/smp: Disable MC domain for shared processor powerpc/smp: Enable Asym packing for cores on shared processor powerpc/sched: Cleanup vcpu_is_preempted() powerpc: add cpu_spec.cpu_features to vmcoreinfo powerpc/imc-pmu: Add a null pointer check in update_events_in_group() powerpc/powernv: Add a null pointer check in opal_powercap_init() powerpc/powernv: Add a null pointer check in opal_event_init() powerpc/powernv: Add a null pointer check to scom_debug_init_one() ...
2024-01-08Merge tag 'vfs-6.8.mount' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains the work to retrieve detailed information about mounts via two new system calls. This is hopefully the beginning of the end of the saga that started with fsinfo() years ago. The LWN articles in [1] and [2] can serve as a summary so we can avoid rehashing everything here. At LSFMM in May 2022 we got into a room and agreed on what we want to do about fsinfo(). Basically, split it into pieces. This is the first part of that agreement. Specifically, it is concerned with retrieving information about mounts. So this only concerns the mount information retrieval, not the mount table change notification, or the extended filesystem specific mount option work. That is separate work. Currently mounts have a 32bit id. Mount ids are already in heavy use by libmount and other low-level userspace but they can't be relied upon because they're recycled very quickly. We agreed that mounts should carry a unique 64bit id by which they can be referenced directly. This is now implemented as part of this work. The new 64bit mount id is exposed in statx() through the new STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is returned. If it is raised and the kernel supports the new 64bit mount id the flag is raised in the result mask and the new 64bit mount id is returned. New and old mount ids do not overlap so they cannot be conflated. Two new system calls are introduced that operate on the 64bit mount id: statmount() and listmount(). A summary of the api and usage can be found on LWN as well (cf. [3]) but of course, I'll provide a summary here as well. Both system calls rely on struct mnt_id_req. Which is the request struct used to pass the 64bit mount id identifying the mount to operate on. It is extensible to allow for the addition of new parameters and for future use in other apis that make use of mount ids. statmount() mimicks the semantics of statx() and exposes a set flags that userspace may raise in mnt_id_req to request specific information to be retrieved. A statmount() call returns a struct statmount filled in with information about the requested mount. Supported requests are indicated by raising the request flag passed in struct mnt_id_req in the @mask argument in struct statmount. Currently we do support: - STATMOUNT_SB_BASIC: Basic filesystem info - STATMOUNT_MNT_BASIC Mount information (mount id, parent mount id, mount attributes etc) - STATMOUNT_PROPAGATE_FROM Propagation from what mount in current namespace - STATMOUNT_MNT_ROOT Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla) - STATMOUNT_MNT_POINT Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt) - STATMOUNT_FS_TYPE Name of the filesystem type as the magic number isn't enough due to submounts The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE are appended to the end of the struct. Userspace can use the offsets in @fs_type, @mnt_root, and @mnt_point to reference those strings easily. The struct statmount reserves quite a bit of space currently for future extensibility. This isn't really a problem and if this bothers us we can just send a follow-up pull request during this cycle. listmount() is given a 64bit mount id via mnt_id_req just as statmount(). It takes a buffer and a size to return an array of the 64bit ids of the child mounts of the requested mount. Userspace can thus choose to either retrieve child mounts for a mount in batches or iterate through the child mounts. For most use-cases it will be sufficient to just leave space for a few child mounts. But for big mount tables having an iterator is really helpful. Iterating through a mount table works by setting @param in mnt_id_req to the mount id of the last child mount retrieved in the previous listmount() call" Link: https://lwn.net/Articles/934469 [1] Link: https://lwn.net/Articles/829212 [2] Link: https://lwn.net/Articles/950569 [3] * tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: add selftest for statmount/listmount fs: keep struct mnt_id_req extensible wire up syscalls for statmount/listmount add listmount(2) syscall statmount: simplify string option retrieval statmount: simplify numeric option retrieval add statmount(2) syscall namespace: extract show_path() helper mounts: keep list of mounts in an rbtree add unique mount ID
2024-01-06Merge tag 'mm-hotfixes-stable-2024-01-05-11-35' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc mm fixes from Andrew Morton: "12 hotfixes. Two are cc:stable and the remainder either address post-6.7 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2024-01-05-11-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() mailmap: add entries for Mathieu Othacehe MAINTAINERS: change vmware.com addresses to broadcom.com arch/mm/fault: fix major fault accounting when retrying under per-VMA lock mm/mglru: skip special VMAs in lru_gen_look_around() MAINTAINERS: hand over hwpoison maintainership to Miaohe Lin MAINTAINERS: remove hugetlb maintainer Mike Kravetz mm: fix unmap_mapping_range high bits shift bug mm: memcg: fix split queue list crash when large folio migration mm: fix arithmetic for max_prop_frac when setting max_ratio mm: fix arithmetic for bdi min_ratio mm: align larger anonymous mappings on THP boundaries
2023-12-29arch/mm/fault: fix major fault accounting when retrying under per-VMA lockSuren Baghdasaryan1-0/+2
A test [1] in Android test suite started failing after [2] was merged. It turns out that after handling a major fault under per-VMA lock, the process major fault counter does not register that fault as major. Before [2] read faults would be done under mmap_lock, in which case FAULT_FLAG_TRIED flag is set before retrying. That in turn causes mm_account_fault() to account the fault as major once retry completes. With per-VMA locks we often retry because a fault can't be handled without locking the whole mm using mmap_lock. Therefore such retries do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can now handle read major faults under per-VMA lock and upon retry the fact there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally we would use an additional VM_FAULT bit to indicate the reason for the retry (could not handle under per-VMA lock vs other reason) but this simpler solution seems to work, so keeping it simple. [1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp [2] https://lore.kernel.org/all/20231006195318.4087158-6-willy@infradead.org/ Link: https://lkml.kernel.org/r/20231226214610.109282-1-surenb@google.com Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29Merge branch 'topic/ppc-kvm' into nextMichael Ellerman8-39/+107
Merge our topic branch containing KVM related patches.
2023-12-29powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2Geoff Levand1-0/+1
Commit 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian builds"), merged in Linux-6.5-rc1 changes the calling ABI in a way that is incompatible with the current code for the PS3's LV1 hypervisor calls. This change just adds the line '# CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not set' to the ps3_defconfig file so that the PPC64_ELF_ABI_V1 is used. Fixes run time errors like these: BUG: Kernel NULL pointer dereference at 0x00000000 Faulting instruction address: 0xc000000000047cf0 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: [c0000000023039e0] [c00000000100ebfc] ps3_create_spu+0xc4/0x2b0 (unreliable) [c000000002303ab0] [c00000000100d4c4] create_spu+0xcc/0x3c4 [c000000002303b40] [c00000000100eae4] ps3_enumerate_spus+0xa4/0xf8 Fixes: 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian builds") Cc: stable@vger.kernel.org # v6.5+ Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/df906ac1-5f17-44b9-b0bb-7cd292a0df65@infradead.org
2023-12-29powerpc/86xx: Drop unused CONFIG_MPC8610Michael Ellerman1-7/+0
The MPC8610 symbol used to be default y if MPC8610_HPCD, but since MPC8610_HPCD was removed MPC8610 is now never used. Remove it. Fixes: 248667f8bbde ("powerpc: drop HPCD/MPC8610 evaluation platform support") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231123032902.2760818-1-mpe@ellerman.id.au
2023-12-28Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues or are not considered backporting material" * tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add an old address for Naoya Horiguchi mm/memory-failure: cast index to loff_t before shifting it mm/memory-failure: check the mapcount of the precise page mm/memory-failure: pass the folio and the page to collect_procs() selftests: secretmem: floor the memory size to the multiple of page_size mm: migrate high-order folios in swap cache correctly maple_tree: do not preallocate nodes for slot stores mm/filemap: avoid buffered read/write race to read inconsistent data kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset kexec: select CRYPTO from KEXEC_FILE instead of depending on it kexec: fix KEXEC_FILE dependencies
2023-12-21powerpc/powernv: Add error handling to opal_prd_range_is_validHaoran Liu1-0/+2
In the opal_prd_range_is_valid function within opal-prd.c, error handling was missing for the of_get_address call. This patch adds necessary error checking, ensuring that the function gracefully handles scenarios where of_get_address fails. Signed-off-by: Haoran Liu <liuhaoran14@163.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231127144108.29782-1-liuhaoran14@163.com
2023-12-21powerpc/hvcall: Reorder Nestedv2 hcall opcodesVaibhav Jain1-10/+10
Reorder the newly introduced hcall opcodes for Nestedv2 to follow the increasing-opcode-number convention followed in 'hvcall.h'. Also updates the value for MAX_HCALL_OPCODE which is used in various places in arch code for range checking. Notably in the KVM enabled-hcall logic, and in hcall tracing. Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests") Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231219092309.118151-1-vaibhav@linux.ibm.com
2023-12-21powerpc/ps3: Add missing set_freezable() for ps3_probe_thread()Kevin Hao1-0/+1
The kernel thread function ps3_probe_thread() invokes the try_to_freeze() in its loop. But all the kernel threads are non-freezable by default. So if we want to make a kernel thread to be freezable, we have to invoke set_freezable() explicitly. Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231221044510.1802429-4-haokexin@gmail.com
2023-12-21powerpc/mpc83xx: Use wait_event_freezable() for freezable kthreadKevin Hao1-2/+1
A freezable kernel thread can enter frozen state during freezing by either calling try_to_freeze() or using wait_event_freezable() and its variants. So for the following snippet of code in a kernel thread loop: wait_event_interruptible(); try_to_freeze(); We can change it to a simple wait_event_freezable() and then eliminate a function call. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231221044510.1802429-3-haokexin@gmail.com
2023-12-21powerpc/mpc83xx: Add the missing set_freezable() for agent_thread_fn()Kevin Hao1-0/+2
The kernel thread function agent_thread_fn() invokes the try_to_freeze() in its loop. But all the kernel threads are non-freezable by default. So if we want to make a kernel thread to be freezable, we have to invoke set_freezable() explicitly. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231221044510.1802429-2-haokexin@gmail.com
2023-12-21kexec: fix KEXEC_FILE dependenciesArnd Bergmann1-2/+2
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the 'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which causes a link failure when all the crypto support is in a loadable module and kexec_file support is built-in: x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load': (.text+0x32e30a): undefined reference to `crypto_alloc_shash' x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update' x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final' Both s390 and x86 have this problem, while ppc64 and riscv have the correct dependency already. On riscv, the dependency is only used for the purgatory, not for the kexec_file code itself, which may be a bit surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable KEXEC_FILE but then the purgatory code is silently left out. Move this into the common Kconfig.kexec file in a way that is correct everywhere, using the dependency on CRYPTO_SHA256=y only when the purgatory code is available. This requires reversing the dependency between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect remains the same, other than making riscv behave like the other ones. On s390, there is an additional dependency on CRYPTO_SHA256_S390, which should technically not be required but gives better performance. Remove this dependency here, noting that it was not present in the initial Kconfig code but was brought in without an explanation in commit 71406883fd357 ("s390/kexec_file: Add kexec_file_load system call"). [arnd@arndb.de: fix riscv build] Link: https://lkml.kernel.org/r/67ddd260-d424-4229-a815-e3fcfb864a77@app.fastmail.com Link: https://lkml.kernel.org/r/20231023110308.1202042-1-arnd@kernel.org Fixes: 6af5138083005 ("x86/kexec: refactor for kernel/Kconfig.kexec") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com> Tested-by: Eric DeVolder <eric_devolder@yahoo.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Conor Dooley <conor@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-19powerpc/fsl: Fix fsl,tmu-calibration to match the schemaDavid Heidelberg2-74/+76
fsl,tmu-calibration is defined as a u32 matrix in Documentation/devicetree/bindings/thermal/qoriq-thermal.yaml. Use matching property syntax. No functional changes. Signed-off-by: David Heidelberg <david@ixit.cz> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212184515.82886-2-david@ixit.cz
2023-12-17Merge tag 'powerpc-6.7-5' of ↵Linus Torvalds2-7/+46
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a bug where heavy VAS (accelerator) usage could race with partition migration and prevent the migration from completing. - Update MAINTAINERS to add Aneesh & Naveen. Thanks to Haren Myneni. * tag 'powerpc-6.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: MAINTAINERS: powerpc: Add Aneesh & Naveen powerpc/pseries/vas: Migration suspend waits for no in-progress open windows
2023-12-16cred: get rid of CONFIG_DEBUG_CREDENTIALSJens Axboe1-1/+0
This code is rarely (never?) enabled by distros, and it hasn't caught anything in decades. Let's kill off this legacy debug code. Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15Merge branch 'smp-topo' into nextMichael Ellerman1-54/+70
Merge a branch containing SMP topology updates from Srikar, purely so we can include the cover letter which has a lot of good detail here: PowerVM systems configured in shared processors mode have some unique challenges. Some device-tree properties will be missing on a shared processor. Hence some sched domains may not make sense for shared processor systems. Most shared processor systems are over-provisioned. Underlying PowerVM Hypervisor would schedule at a Big Core (SMT8) granularity. The most recent power processors support two almost independent cores. In a lightly loaded condition, it helps the overall system performance if we pack to lesser number of Big Cores. Since each thread-group is independent, running threads on both the thread-groups of a SMT8 core, should have a minimal adverse impact in non over provisioned scenarios. These changes in this patchset will not affect in the over provisioned scenario. If there are more threads than SMT domains, then asym_packing will not kick-in. System Configuration type=Shared mode=Uncapped smt=8 lcpu=96 mem=1066409344 kB cpus=96 ent=64.00 So *64 Entitled cores/ 96 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 768 On-line CPU(s) list: 0-767 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 16 Socket(s): 6 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 6 MiB (192 instances) L1i cache: 9 MiB (192 instances) NUMA node(s): 6 NUMA node0 CPU(s): 0-7,32-39,80-87,128-135,176-183,224-231,272-279,320-327,368-375,416-423,464-471,512-519,560-567,608-615,656-663,704-711,752-759 NUMA node1 CPU(s): 8-15,40-47,88-95,136-143,184-191,232-239,280-287,328-335,376-383,424-431,472-479,520-527,568-575,616-623,664-671,712-719,760-767 NUMA node4 CPU(s): 64-71,112-119,160-167,208-215,256-263,304-311,352-359,400-407,448-455,496-503,544-551,592-599,640-647,688-695,736-743 NUMA node5 CPU(s): 16-23,48-55,96-103,144-151,192-199,240-247,288-295,336-343,384-391,432-439,480-487,528-535,576-583,624-631,672-679,720-727 NUMA node6 CPU(s): 72-79,120-127,168-175,216-223,264-271,312-319,360-367,408-415,456-463,504-511,552-559,600-607,648-655,696-703,744-751 NUMA node7 CPU(s): 24-31,56-63,104-111,152-159,200-207,248-255,296-303,344-351,392-399,440-447,488-495,536-543,584-591,632-639,680-687,728-735 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3840178 4059268 3978042 3973936.6 84264.456 +patch 5 3768393 3927901 3874994 3854046 71532.926 -3.01692 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 4.16 0.00 0.00 95.84 26.06 40.72 4.16 69.88 276906989 578 +patch 4.16 0.00 0.00 95.83 17.70 27.66 4.17 78.26 70436663 119 ebizzy -t 128 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 5520692 5981856 5717709 5727053.2 176093.2 +patch 5 5305888 6259610 5854590 5843311 375917.03 2.02998 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 16.66 0.00 0.00 83.33 45.49 71.08 16.67 50.50 288778533 581 +patch 16.65 0.00 0.00 83.35 30.15 47.11 16.65 65.76 85196150 133 ebizzy -t 512 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 19563921 20049955 19701510 19728733 198295.18 +patch 5 19455992 20176445 19718427 19832017 304094.05 0.523521 >From lparstat (when the workload stabilized) %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 66.6.0-rc3 6.44 0.01 0.00 33.55 94.14 147.09 66.45 1.33 313345175 621 6+patch 6.44 0.01 0.00 33.55 94.15 147.11 66.45 1.33 109193889 309 System Configuration type=Shared mode=Uncapped smt=8 lcpu=40 mem=1067539392 kB cpus=96 ent=40.00 So *40 Entitled cores/ 40 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 320 On-line CPU(s) list: 0-319 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 10 Socket(s): 4 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 2.5 MiB (80 instances) L1i cache: 3.8 MiB (80 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 NUMA node4 CPU(s): 16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311 NUMA node5 CPU(s): 24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3535518 3864532 3745967 3704233.2 130216.76 +patch 5 3608385 3708026 3649379 3651596.6 37862.163 -1.42099 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 10.00 0.01 0.00 89.99 22.98 57.45 10.01 41.01 1135139 262 +patch 10.00 0.00 0.00 90.00 16.95 42.37 10.00 47.05 925561 19 ebizzy -t 64 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 4434984 4957281 4548786 4591298.2 211770.2 +patch 5 4461115 4835167 4544716 4607795.8 151474.85 0.359323 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 20.01 0.00 0.00 79.99 38.22 95.55 20.01 25.77 1287553 265 +patch 19.99 0.00 0.00 80.01 25.55 63.88 19.99 38.44 1077341 20 ebizzy -t 256 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 8850648 8982659 8951911 8936869.2 52278.031 +patch 5 8751038 9060510 8981409 8942268.4 117070.6 0.0604149 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 80.02 0.01 0.01 19.96 40.00 100.00 80.03 24.00 1597665 276 +patch 80.02 0.01 0.01 19.96 40.00 100.00 80.03 23.99 1383921 63 Observation: We are able to see Improvement in ebizzy throughput even with lesser core utilization (almost half the core utilization) in low utilization scenarios while still retaining throughput in mid and higher utilization scenarios. Note: The numbers are with Uncapped + no-noise case. In the Capped and/or noise case, due to contention on the Cores, the numbers are expected to further improve. Note: The numbers included (sched/fair: Enable group_asym_packing in find_idlest_group) https://lore.kernel.org/all/20231018155036.2314342-1-srikar@linux.vnet.ibm.com/
2023-12-15powerpc/smp: Dynamically build Powerpc topologySrikar Dronamraju1-50/+28
Currently there are four Powerpc specific sched topologies. These are all statically defined. However not all these topologies are used by all Powerpc systems. To avoid unnecessary degenerations by the scheduler, masks and flags are compared. However if the sched topologies are build dynamically then the code is simpler and there are greater chances of avoiding degenerations. Note: Even X86 builds its sched topologies dynamically and proposed changes are very similar to the way X86 is building its topologies. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214180720.310852-6-srikar@linux.vnet.ibm.com
2023-12-15powerpc/smp: Avoid asym packing within thread_group of a coreSrikar Dronamraju1-0/+13
PowerVM Hypervisor will schedule at a core granularity. However each core can have more than one thread_groups. For better utilization in case of a shared processor, its preferable for the scheduler to pack to the lowest core. However there is no benefit of moving a thread between two thread groups of the same core. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214180720.310852-5-srikar@linux.vnet.ibm.com
2023-12-15powerpc/smp: Add __ro_after_init attributeSrikar Dronamraju1-5/+5
There are some variables that are only updated at boot time. So add __ro_after_init attribute to such variables Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214180720.310852-4-srikar@linux.vnet.ibm.com
2023-12-15powerpc/smp: Disable MC domain for shared processorSrikar Dronamraju1-0/+4
Like L2-cache info, coregroup information which is used to determine MC sched domains is only present on dedicated LPARs. i.e PowerVM doesn't export coregroup information for shared processor LPARs. Hence disable creating MC domains on shared LPAR Systems. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214180720.310852-3-srikar@linux.vnet.ibm.com
2023-12-15powerpc/smp: Enable Asym packing for cores on shared processorSrikar Dronamraju1-2/+23
If there are shared processor LPARs, underlying Hypervisor can have more virtual cores to handle than actual physical cores. Starting with Power 9, a big core (aka SMT8 core) has 2 nearly independent thread groups. On a shared processors LPARs, it helps to pack threads to lesser number of cores so that the overall system performance and utilization improves. PowerVM schedules at a big core level. Hence packing to fewer cores helps. Since each thread-group is independent, running threads on both the thread-groups of a SMT8 core, should have a minimal adverse impact in non over provisioned scenarios. These changes in this patchset will not affect in the over provisioned scenario. If there are more threads than SMT domains, then asym_packing will not kick-in For example: Lets says there are two 8-core Shared LPARs that are actually sharing a 8 Core shared physical pool, each running 8 threads each. Then Consolidating 8 threads to 4 cores on each LPAR would help them to perform better. This is because each of the LPAR will get 100% time to run applications and there will no switching required by the Hypervisor. To achieve this, enable SD_ASYM_PACKING flag at CACHE, MC and DIE level when the system is running in shared processor mode and has big cores. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214180720.310852-2-srikar@linux.vnet.ibm.com
2023-12-15powerpc/sched: Cleanup vcpu_is_preempted()Aneesh Kumar K.V1-8/+25
No functional change in this patch. A helper is added to find if vcpu is dispatched by hypervisor. Use that instead of opencoding. Also clarify some of the comments. Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231114071219.198222-1-aneesh.kumar@linux.ibm.com
2023-12-14wire up syscalls for statmount/listmountMiklos Szeredi1-0/+2
Wire up all archs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20231025140205.3586473-7-mszeredi@redhat.com Reviewed-by: Ian Kent <raven@themaw.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-13powerpc: add cpu_spec.cpu_features to vmcoreinfoAditya Gupta1-0/+1
CPU features can be determined in makedumpfile, using 'cur_cpu_spec.cpu_features'. This provides more data to makedumpfile about the crashed system, and can help in filtering the vmcore accordingly. Signed-off-by: Aditya Gupta <adityag@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230920105706.853626-2-adityag@linux.ibm.com
2023-12-13powerpc/imc-pmu: Add a null pointer check in update_events_in_group()Kunwu Chan1-0/+6
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: 885dcd709ba9 ("powerpc/perf: Add nest IMC PMU support") Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231126093719.1440305-1-chentao@kylinos.cn
2023-12-13powerpc/powernv: Add a null pointer check in opal_powercap_init()Kunwu Chan1-0/+6
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: b9ef7b4b867f ("powerpc: Convert to using %pOFn instead of device_node.name") Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231126095739.1501990-1-chentao@kylinos.cn
2023-12-13powerpc/powernv: Add a null pointer check in opal_event_init()Kunwu Chan1-0/+2
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: 2717a33d6074 ("powerpc/opal-irqchip: Use interrupt names if present") Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231127030755.1546750-1-chentao@kylinos.cn
2023-12-13powerpc/powernv: Add a null pointer check to scom_debug_init_one()Kunwu Chan1-0/+5
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Add a null pointer check, and release 'ent' to avoid memory leaks. Fixes: bfd2f0d49aef ("powerpc/powernv: Get rid of old scom_controller abstraction") Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231208085937.107210-1-chentao@kylinos.cn
2023-12-13powerpc/mm: Fix null-pointer dereference in pgtable_cache_addKunwu Chan1-2/+3
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231204023223.2447523-1-chentao@kylinos.cn
2023-12-13powerpc/pseries/vas: Migration suspend waits for no in-progress open windowsHaren Myneni2-7/+46
The hypervisor returns migration failure if all VAS windows are not closed. During pre-migration stage, vas_migration_handler() sets migration_in_progress flag and closes all windows from the list. The allocate VAS window routine checks the migration flag, setup the window and then add it to the list. So there is possibility of the migration handler missing the window that is still in the process of setup. t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL unlock vas_pseries_mutex Modify window HCALL lock vas_pseries_mutex setup window migration_in_progress=true Closes all windows from the list // May miss windows that are // not in the list unlock vas_pseries_mutex lock vas_pseries_mutex return if nr_closed_windows == 0 // No DLPAR CPU or migration add window to the list // Window will be added to the // list after the setup is completed unlock vas_pseries_mutex return unlock vas_pseries_mutex Close VAS window // due to DLPAR CPU or migration return -EBUSY This patch resolves the issue with the following steps: - Set the migration_in_progress flag without holding mutex. - Introduce nr_open_wins_progress counter in VAS capabilities struct - This counter tracks the number of open windows are still in progress - The allocate setup window thread closes windows if the migration is set and decrements nr_open_window_progress counter - The migration handler waits for no in-progress open windows. The code flow with the fix is as follows: t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL nr_open_wins_progress++ // Window opened, but not // added to the list yet unlock vas_pseries_mutex Modify window HCALL migration_in_progress=true setup window lock vas_pseries_mutex Closes all windows from the list While nr_open_wins_progress { unlock vas_pseries_mutex lock vas_pseries_mutex sleep if nr_closed_windows == 0 // Wait if any open window in or migration is not started // progress. The open window // No DLPAR CPU or migration // thread closes the window without add window to the list // adding to the list and return if nr_open_wins_progress-- // the migration is in progress. unlock vas_pseries_mutex return Close VAS window nr_open_wins_progress-- unlock vas_pseries_mutex return -EBUSY lock vas_pseries_mutex } unlock vas_pseries_mutex return Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler") Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231125235104.3405008-1-haren@linux.ibm.com
2023-12-13powerpc/Kconfig: Select FUNCTION_ALIGNMENT_4BSathvika Vasireddy2-3/+1
Commit d49a0626216b95 ("arch: Introduce CONFIG_FUNCTION_ALIGNMENT") introduced a generic function-alignment infrastructure. Move to using FUNCTION_ALIGNMENT_4B on powerpc, to use the same alignment as that of the existing _GLOBAL macro. Signed-off-by: Sathvika Vasireddy <sv@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/21892186ec44abe24df0daf64f577dac0e78783f.1702045299.git.naveen@kernel.org
2023-12-13powerpc/ftrace: Remove nops after the call to ftrace_stubNaveen N Rao1-2/+0
ftrace_stub is within the same CU, so there is no need for a subsequent nop instruction. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/8ee5ec520e37d5523654bb2cd65a17512fb774e2.1702045299.git.naveen@kernel.org
2023-12-13powerpc/ftrace: Fix indentation in ftrace.hNaveen N Rao1-1/+1
Replace seven spaces with a tab character to fix an indentation issue reported by the kernel test robot. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311221731.alUwTDIm-lkp@intel.com/ Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/9f058227bd9243f0842786ef7228d87ab10d29f6.1702045299.git.naveen@kernel.org
2023-12-13powerpc/pseries/papr-sysparm: Expose character device to user spaceNathan Lynch3-8/+225
Until now the papr_sysparm APIs have been kernel-internal. But user space needs access to PAPR system parameters too. The only method available to user space today to get or set system parameters is using sys_rtas() and /dev/mem to pass RTAS-addressable buffers between user space and firmware. This is incompatible with lockdown and should be deprecated. So provide an alternative ABI to user space in the form of a /dev/papr-sysparm character device with just two ioctl commands (get and set). The data payloads involved are small enough to fit in the ioctl argument buffer, making the code relatively simple. Exposing the system parameters through sysfs has been considered but it would be too awkward: * The kernel currently does not have to contain an exhaustive list of defined system parameters. This is a convenient property to maintain because we don't have to update the kernel whenever a new parameter is added to PAPR. Exporting a named attribute in sysfs for each parameter would negate this. * Some system parameters are text-based and some are not. * Retrieval of at least one system parameter requires input data, which a simple read-oriented interface can't support. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-11-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/pseries/papr-sysparm: Validate buffer object lengthsNathan Lynch1-0/+47
The ability to get and set system parameters will be exposed to user space, so let's get a little more strict about malformed papr_sysparm_buf objects. * Create accessors for the length field of struct papr_sysparm_buf. The length is always stored in MSB order and this is better than spreading the necessary conversions all over. * Reject attempts to submit invalid buffers to RTAS. * Warn if RTAS returns a buffer with an invalid length, clamping the returned length to a safe value that won't overrun the buffer. These are meant as precautionary measures to mitigate both firmware and kernel bugs in this area, should they arise, but I am not aware of any. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-10-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/pseries: Add papr-vpd character driver for VPD retrievalNathan Lynch4-0/+573
PowerVM LPARs may retrieve Vital Product Data (VPD) for system components using the ibm,get-vpd RTAS function. We can expose this to user space with a /dev/papr-vpd character device, where the programming model is: struct papr_location_code plc = { .str = "", }; /* obtain all VPD */ int devfd = open("/dev/papr-vpd", O_RDONLY); int vpdfd = ioctl(devfd, PAPR_VPD_CREATE_HANDLE, &plc); size_t size = lseek(vpdfd, 0, SEEK_END); char *buf = malloc(size); pread(devfd, buf, size, 0); When a file descriptor is obtained from ioctl(PAPR_VPD_CREATE_HANDLE), the file contains the result of a complete ibm,get-vpd sequence. The file contents are immutable from the POV of user space. To get a new view of the VPD, the client must create a new handle. This design choice insulates user space from most of the complexities that ibm,get-vpd brings: * ibm,get-vpd must be called more than once to obtain complete results. * Only one ibm,get-vpd call sequence should be in progress at a time; interleaved sequences will disrupt each other. Callers must have a protocol for serializing their use of the function. * A call sequence in progress may receive a "VPD changed, try again" status, requiring the client to abandon the sequence and start over. The memory required for the VPD buffers seems acceptable, around 20KB for all VPD on one of my systems. And the value of the /rtas/ibm,vpd-size DT property (the estimated maximum size of VPD) is consistently 300KB across various systems I've checked. I've implemented support for this new ABI in the rtas_get_vpd() function in librtas, which the vpdupdate command currently uses to populate its VPD database. I've verified that an unmodified vpdupdate binary generates an identical database when using a librtas.so that prefers the new ABI. Along with the papr-vpd.h header exposed to user space, this introduces a common papr-miscdev.h uapi header to share a base ioctl ID with similar drivers to come. Tested-by: Michal Suchánek <msuchanek@suse.de> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-9-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Warn if per-function lock isn't heldNathan Lynch1-12/+9
If the function descriptor has a populated lock member, then callers are required to hold it across calls. Now that the firmware activation sequence is appropriately guarded, we can warn when the requirement isn't satisfied. __do_enter_rtas_trace() gets reorganized a bit as a result of performing the function descriptor lookup unconditionally now. Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-8-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Serialize firmware activation sequencesNathan Lynch1-0/+4
Use rtas_ibm_activate_firmware_lock to prevent interleaving call sequences of the ibm,activate-firmware RTAS function, which typically requires multiple calls to complete the update. While the spec does not specifically prohibit interleaved sequences, there's almost certainly no advantage to allowing them. Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-7-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Facilitate high-level call sequencesNathan Lynch2-0/+86
On RTAS platforms there is a general restriction that the OS must not enter RTAS on more than one CPU at a time. This low-level serialization requirement is satisfied by holding a spin lock (rtas_lock) across most RTAS function invocations. However, some pseries RTAS functions require multiple successive calls to complete a logical operation. Beginning a new call sequence for such a function may disrupt any other sequences of that function already in progress. Safe and reliable use of these functions effectively requires higher-level serialization beyond what is already done at the level of RTAS entry and exit. Where a sequence-based RTAS function is invoked only through sys_rtas(), with no in-kernel users, there is no issue as far as the kernel is concerned. User space is responsible for appropriately serializing its call sequences. (Whether user space code actually takes measures to prevent sequence interleaving is another matter.) Examples of such functions currently include ibm,platform-dump and ibm,get-vpd. But where a sequence-based RTAS function has both user space and in-kernel uesrs, there is a hazard. Even if the in-kernel call sites of such a function serialize their sequences correctly, a user of sys_rtas() can invoke the same function at any time, potentially disrupting a sequence in progress. So in order to prevent disruption of kernel-based RTAS call sequences, they must serialize not only with themselves but also with sys_rtas() users, somehow. Preferably without adding more function-specific hacks to sys_rtas(). This is a prerequisite for adding an in-kernel call sequence of ibm,get-vpd, which is in a change to follow. Note that it has never been feasible for the kernel to prevent sys_rtas()-based sequences from being disrupted because control returns to user space on every call. sys_rtas()-based users of these functions have always been, and continue to be, responsible for coordinating their call sequences with other users, even those which may invoke the RTAS functions through less direct means than sys_rtas(). This is an unavoidable consequence of exposing sequence-based RTAS functions through sys_rtas(). * Add an optional mutex member to struct rtas_function. * Statically define a mutex for each RTAS function with known call sequence serialization requirements, and assign its address to the .lock member of the corresponding function table entry, along with justifying commentary. * In sys_rtas(), if the table entry for the RTAS function being called has a populated lock member, acquire it before taking rtas_lock and entering RTAS. * Kernel-based RTAS call sequences are expected to access the appropriate mutex explicitly by name. For example, a user of the ibm,activate-firmware RTAS function would do: int token = rtas_function_token(RTAS_FN_IBM_ACTIVATE_FIRMWARE); int fwrc; mutex_lock(&rtas_ibm_activate_firmware_lock); do { fwrc = rtas_call(token, 0, 1, NULL); } while (rtas_busy_delay(fwrc)); mutex_unlock(&rtas_ibm_activate_firmware_lock); There should be no perceivable change introduced here except that concurrent callers of the same RTAS function via sys_rtas() may block on a mutex instead of spinning on rtas_lock. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-6-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Move token validation from block_rtas_call() to sys_rtas()Nathan Lynch1-16/+16
The rtas system call handler sys_rtas() delegates certain input validation steps to a helper function: block_rtas_call(). One of these steps ensures that the user-supplied token value maps to a known RTAS function. This is done by performing a "reverse" token-to-function lookup via rtas_token_to_function_untrusted() to obtain an rtas_function object. In changes to come, sys_rtas() itself will need the function descriptor for the token. To prepare: * Move the lookup and validation up into sys_rtas() and pass the resulting rtas_function pointer to block_rtas_call(), which is otherwise unconcerned with the token value. * Change block_rtas_call() to report the RTAS function name instead of the token value on validation failures, since it can now rely on having a valid function descriptor. One behavior change is that sys_rtas() now silently errors out when passed a bad token, before calling block_rtas_call(). So we will no longer log "RTAS call blocked - exploit attempt?" on invalid tokens. This is consistent with how sys_rtas() currently handles other "metadata" (nargs and nret), while block_rtas_call() is primarily concerned with validating the arguments to be passed to specific RTAS functions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-5-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Add function return status constantsNathan Lynch1-6/+19
Not all of the generic RTAS function statuses specified in PAPR have symbolic constants and descriptions in rtas.h. Fix this, providing a little more background, slightly updating the existing wording, and improving the formatting. Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-4-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Fall back to linear search on failed token->function lookupNathan Lynch1-4/+14
Enabling any of the powerpc:rtas_* tracepoints at boot is likely to result in an oops on RTAS platforms. For example, booting a QEMU pseries model with 'trace_event=powerpc:rtas_input' in the command line leads to: BUG: Kernel NULL pointer dereference on read at 0x00000008 Oops: Kernel access of bad area, sig: 7 [#1] NIP [c00000000004231c] do_enter_rtas+0x1bc/0x460 LR [c00000000004231c] do_enter_rtas+0x1bc/0x460 Call Trace: do_enter_rtas+0x1bc/0x460 (unreliable) rtas_call+0x22c/0x4a0 rtas_get_boot_time+0x80/0x14c read_persistent_clock64+0x124/0x150 read_persistent_wall_and_boot_offset+0x28/0x58 timekeeping_init+0x70/0x348 start_kernel+0xa0c/0xc1c start_here_common+0x1c/0x20 (This is preceded by a warning for the failed lookup in rtas_token_to_function().) This happens when __do_enter_rtas_trace() attempts a token to function descriptor lookup before the xarray containing the mappings has been set up. Fall back to linear scan of the table if rtas_token_to_function_xarray is empty. Fixes: 24098f580e2b ("powerpc/rtas: add tracepoints around RTAS entry") Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-3-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Add for_each_rtas_function() iteratorNathan Lynch1-2/+7
Add a convenience macro for iterating over every element of the internal function table and convert the one site that can use it. An additional user of the macro is anticipated in changes to follow. Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-2-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/rtas: Avoid warning on invalid token argument to sys_rtas()Nathan Lynch1-2/+17
rtas_token_to_function() WARNs when passed an invalid token; it's meant to catch bugs in kernel-based users of RTAS functions. However, user space controls the token value passed to rtas_token_to_function() by block_rtas_call(), so user space with sufficient privilege to use sys_rtas() can trigger the warnings at will: unexpected failed lookup for token 2048 WARNING: CPU: 20 PID: 2247 at arch/powerpc/kernel/rtas.c:556 rtas_token_to_function+0xfc/0x110 ... NIP rtas_token_to_function+0xfc/0x110 LR rtas_token_to_function+0xf8/0x110 Call Trace: rtas_token_to_function+0xf8/0x110 (unreliable) sys_rtas+0x188/0x880 system_call_exception+0x268/0x530 system_call_common+0x160/0x2c4 It's desirable to continue warning on bogus tokens in rtas_token_to_function(). Currently it is used to look up RTAS function descriptors when tracing, where we know there has to have been a successful descriptor lookup by different means already, and it would be a serious inconsistency for the reverse lookup to fail. So instead of weakening rtas_token_to_function()'s contract by removing the warnings, introduce rtas_token_to_function_untrusted(), which has no opinion on failed lookups. Convert block_rtas_call() and rtas_token_to_function() to use it. Fixes: 8252b88294d2 ("powerpc/rtas: improve function information lookups") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-1-e9eafd0c8c6c@linux.ibm.com
2023-12-13powerpc/hv-gpci: Add return value check in ↵Kajol Jain1-0/+3
affinity_domain_via_partition_show function To access hv-gpci kernel interface files data, the "Enable Performance Information Collection" option has to be set in hmc. Incase that option is not set and user try to read the interface files, it should give error message as operation not permitted. Result of accessing added interface files with disabled performance collection option: [command]# cat processor_bus_topology cat: processor_bus_topology: Operation not permitted [command]# cat processor_config cat: processor_config: Operation not permitted [command]# cat affinity_domain_via_domain cat: affinity_domain_via_domain: Operation not permitted [command]# cat affinity_domain_via_virtual_processor cat: affinity_domain_via_virtual_processor: Operation not permitted [command]# cat affinity_domain_via_partition Based on above result there is no error message when reading affinity_domain_via_partition file because of missing check for failed hcall. Fix this issue by adding a check in the start of affinity_domain_via_partition_show function, to return error incase hcall fails, with error type other then H_PARAMETER. Fixes: a15e0d6a6929 ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity domain via partition information") Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231116122033.160964-1-kjain@linux.ibm.com
2023-12-07powerpc/Makefile: Auto detect cross compilerMichael Ellerman1-0/+11
If no cross compiler is specified, try to auto detect one. Look for various combinations, matching: powerpc(64(le)?)?(-unknown)?-linux(-gnu)?- There are more possibilities, but the above is known to find a compiler on Fedora and Ubuntu (which use linux-gnu-), and also detects the kernel.org cross compilers (which use linux-). This allows cross compiling with simply: # Ubuntu $ sudo apt install gcc-powerpc-linux-gnu # Fedora $ sudo dnf install gcc-powerpc64-linux-gnu $ make ARCH=powerpc defconfig $ make ARCH=powerpc -j 4 Inspired by arch/parisc/Makefile. Acked-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231206115548.1466874-4-mpe@ellerman.id.au
2023-12-07powerpc/Makefile: Default to ppc64le_defconfig when cross buildingMichael Ellerman1-2/+2
If the kernel is being cross compiled, there is no information from uname on which defconfig is most appropriate, so the Makefile defaults to ppc64. However these days almost all distros that support powerpc are little endian, so it's more likely that defaulting to ppc64le_defconfig will produce something useful for a user. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231206115548.1466874-3-mpe@ellerman.id.au