summaryrefslogtreecommitdiff
path: root/arch/s390
AgeCommit message (Collapse)AuthorFilesLines
2023-06-09Merge tag 's390-6.4-3' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Alexander Gordeev: - Avoid linker error for randomly generated config file that has CONFIG_BRANCH_PROFILE_NONE enabled and make it similar to riscv, x86 and also to commit 4bf3ec384edf ("s390: disable branch profiling for vdso"). - Currently, if the device is offline and all the channel paths are either configured or varied offline, the associated subchannel gets unregistered. Don't unregister the subchannel, instead unregister offline device. * tag 's390-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/purgatory: disable branch profiling s390/cio: unregister device when the only path is gone
2023-06-01s390/purgatory: disable branch profilingAlexander Gordeev1-0/+1
Avoid linker error for randomly generated config file that has CONFIG_BRANCH_PROFILE_NONE enabled and make it similar to riscv, x86 and also to commit 4bf3ec384edf ("s390: disable branch profiling for vdso"). Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-19Merge tag 's390-6.4-2' of ↵Linus Torvalds9-35/+26
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Alexander Gordeev: - Add check whether the required facilities are installed before using the s390-specific ChaCha20 implementation - Key blobs for s390 protected key interface IOCTLs commands PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material. Zeroize copies of these keys in kernel memory after creating protected keys - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of initializing all stack variables by default - Make sure that when a new channel-path is enabled all subchannels are evaluated: with and without any devices connected on it - When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids limit is not checked and could be exceeded. Respect the nr_cpu_ids limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set - The pointer to IPL Parameter Information Block is stored in the absolute lowcore as a virtual address. Save it as the physical address for later use by dump tools - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST with dedicated (pass through) QDIO-based devices such as FCP, real OSA or HiperSockets - s390's struct statfs and struct statfs64 contain padding, which field-by-field copying does not set. Initialize the respective structures with zeros before filling them and copying to userspace - Grow s390 compat_statfs64, statfs and statfs64 structures f_spare array member to cover padding and simplify things - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs * tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER} s390/uapi: cover statfs padding by growing f_spare statfs: enforce statfs[64] structure initialization s390/qdio: fix do_sqbs() inline assembly constraint s390/ipl: fix IPIB virtual vs physical address confusion s390/topology: honour nr_cpu_ids when adding CPUs s390/cio: include subchannels without devices also for evaluation s390/defconfigs: set CONFIG_INIT_STACK_NONE=y s390/pkey: zeroize key blobs s390/crypto: use vector instructions only if available for ChaCha20
2023-05-18rethook, fprobe: do not trace rethook related functionsZe Gao1-0/+1
These functions are already marked as NOKPROBE to prevent recursion and we have the same reason to blacklist them if rethook is used with fprobe, since they are beyond the recursion-free region ftrace can guard. Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/ Fixes: f3a112c0c40d ("x86,rethook,kprobes: Replace kretprobe with rethook on x86") Signed-off-by: Ze Gao <zegao@tencent.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-17s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMUJason Gunthorpe3-6/+1
These don't do anything anymore, the only user of the symbol was VFIO_CCW/AP which already "depends on VFIO" and VFIO itself selects IOMMU_API. When this was added VFIO was wrongly doing "depends on IOMMU_API" which required some contortions like this to ensure IOMMU_API was turned on. Reviewed-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v2-eb322ce2e547+188f-rm_iommu_ccw_jgg@nvidia.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-17s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER}Lukas Bulwahn1-8/+0
Commit f1045056c726 ("topology/sysfs: rework book and drawer topology ifdefery") activates the book and drawer topology, previously activated by CONFIG_SCHED_{BOOK,DRAWER}, dependent on the existence of certain macro definitions. Hence, since then, CONFIG_SCHED_{BOOK,DRAWER} have no effect and any further purpose. Remove the obsolete configs SCHED_{BOOK,DRAWER}. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20230508040916.16733-1-lukas.bulwahn@gmail.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-17s390/uapi: cover statfs padding by growing f_spareIlya Leoshkevich2-3/+3
pahole says: struct compat_statfs64 { ... u32 f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ struct statfs { ... unsigned int f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ struct statfs64 { ... unsigned int f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ One has to keep the existence of padding in mind when working with these structs. Grow f_spare arrays to 5 in order to simplify things. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230504144021.808932-3-iii@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-15s390/ipl: fix IPIB virtual vs physical address confusionAlexander Gordeev1-2/+1
The pointer to IPL Parameter Information Block is stored in the absolute lowcore for later use by dump tools. That pointer is a virtual address, though it should be physical instead. Note, this does not fix a real issue, since virtual and physical addresses are currently the same. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-15s390/topology: honour nr_cpu_ids when adding CPUsAlexander Gordeev1-15/+17
When SMT thread CPUs are added to CPU masks the nr_cpu_ids limit is not checked and could be exceeded. This leads to a warning for example if CONFIG_DEBUG_PER_CPU_MAPS is set and the command line parameter nr_cpus is set to 1. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-15s390/defconfigs: set CONFIG_INIT_STACK_NONE=yHeiko Carstens3-0/+3
Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid the extra overhead of initializing all stack variables by default. Users who want to have that must change the configuration on their own. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-15s390/crypto: use vector instructions only if available for ChaCha20Heiko Carstens1-1/+1
Commit 349d03ffd5f6 ("crypto: s390 - add crypto library interface for ChaCha20") added a library interface to the s390 specific ChaCha20 implementation. However no check was added to verify if the required facilities are installed before branching into the assembler code. If compiled into the kernel, this will lead to the following crash, if vector instructions are not available: data exception: 0007 ilc:3 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc7+ #11 Hardware name: IBM 3931 A01 704 (KVM/Linux) Krnl PSW : 0704e00180000000 000000001857277a (chacha20_vx+0x32/0x818) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000037f0000000a ffffffffffffff60 000000008184b000 0000000019f5c8e6 0000000000000109 0000037fffb13c58 0000037fffb13c78 0000000019bb1780 0000037fffb13c58 0000000019f5c8e6 000000008184b000 0000000000000109 00000000802d8000 0000000000000109 0000000018571ebc 0000037fffb13718 Krnl Code: 000000001857276a: c07000b1f80b larl %r7,0000000019bb1780 0000000018572770: a708000a lhi %r0,10 #0000000018572774: e78950000c36 vlm %v24,%v25,0(%r5),0 >000000001857277a: e7a060000806 vl %v26,0(%r6),0 0000000018572780: e7bf70004c36 vlm %v27,%v31,0(%r7),4 0000000018572786: e70b00000456 vlr %v0,%v27 000000001857278c: e71800000456 vlr %v1,%v24 0000000018572792: e74b00000456 vlr %v4,%v27 Call Trace: [<000000001857277a>] chacha20_vx+0x32/0x818 Last Breaking-Event-Address: [<0000000018571eb6>] chacha20_crypt_s390.constprop.0+0x6e/0xd8 ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Fix this by adding a missing MACHINE_HAS_VX check. Fixes: 349d03ffd5f6 ("crypto: s390 - add crypto library interface for ChaCha20") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Cc: <stable@vger.kernel.org> # 5.19+ Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> [agordeev@linux.ibm.com: remove duplicates in commit message] Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-06s390: remove the unneeded select GCC12_NO_ARRAY_BOUNDSLukas Bulwahn1-1/+0
Commit 0da6e5fd6c37 ("gcc: disable '-Warray-bounds' for gcc-13 too") makes config GCC11_NO_ARRAY_BOUNDS to be for disabling -Warray-bounds in any gcc version 11 and upwards, and with that, removes the GCC12_NO_ARRAY_BOUNDS config as it is now covered by the semantics of GCC11_NO_ARRAY_BOUNDS. As GCC11_NO_ARRAY_BOUNDS is yes by default, there is no need for the s390 architecture to explicitly select GCC11_NO_ARRAY_BOUNDS. Hence, the select GCC12_NO_ARRAY_BOUNDS in arch/s390/Kconfig can simply be dropped. Remove the unneeded "select GCC12_NO_ARRAY_BOUNDS". Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-05Merge tag 'locking-core-2023-05-05' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce local{,64}_try_cmpxchg() - a slightly more optimal primitive, which will be used in perf events ring-buffer code - Simplify/modify rwsems on PREEMPT_RT, to address writer starvation - Misc cleanups/fixes * tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomic: Correct (cmp)xchg() instrumentation locking/x86: Define arch_try_cmpxchg_local() locking/arch: Wire up local_try_cmpxchg() locking/generic: Wire up local{,64}_try_cmpxchg() locking/atomic: Add generic try_cmpxchg{,64}_local() support locking/rwbase: Mitigate indefinite writer starvation locking/arch: Rename all internal __xchg() names to __arch_xchg()
2023-05-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-21/+23
Pull more kvm updates from Paolo Bonzini: "This includes the 6.4 changes for RISC-V, and a few bugfix patches for other architectures. For x86, this closes a longstanding performance issue in the newer and (usually) more scalable page table management code. RISC-V: - ONE_REG interface to enable/disable SBI extensions - Zbb extension for Guest/VM - AIA CSR virtualization x86: - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU can result in the root being freed even though the root is completely valid and can be reused as-is (with a TLB flush). s390: - A couple of bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: fix race in gmap_make_secure() KVM: s390: pv: fix asynchronous teardown for small VMs KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated RISC-V: KVM: Virtualize per-HART AIA CSRs RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask RISC-V: KVM: Add ONE_REG interface for AIA CSRs RISC-V: KVM: Implement subtype for CSR ONE_REG interface RISC-V: KVM: Initial skeletal support for AIA RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines RISC-V: Detect AIA CSRs from ISA string RISC-V: Add AIA related CSR defines RISC-V: KVM: Allow Zbb extension for Guest/VM RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions RISC-V: KVM: Alphabetize selects KVM: RISC-V: Retry fault if vma_lookup() results become invalid
2023-05-05Merge tag 'kvm-s390-next-6.4-2' of ↵Paolo Bonzini3-21/+23
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD For 6.4
2023-05-04Merge tag 'mm-stable-2023-05-03-16-22' of ↵Linus Torvalds1-19/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - Some DAMON cleanups from Kefeng Wang - Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE ioctl's behavior more similar to KSM's behavior. [ Andrew called these "final", but I suspect we'll have a series fixing up the fact that the last commit in the dmapools series in the previous pull seems to have unintentionally just reverted all the other commits in the same series.. - Linus ] * tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: hwpoison: coredump: support recovery from dump_user_range() mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() mm/ksm: move disabling KSM from s390/gmap code to KSM code selftests/ksm: ksm_functional_tests: add prctl unmerge test mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 mm/damon/paddr: fix missing folio_sz update in damon_pa_young() mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate() mm/damon/paddr: minor refactor of damon_pa_pageout()
2023-05-04KVM: s390: fix race in gmap_make_secure()Claudio Imbrenda1-21/+11
Fix a potential race in gmap_make_secure() and remove the last user of follow_page() without FOLL_GET. The old code is locking something it doesn't have a reference to, and as explained by Jason and David in this discussion: https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/ it can lead to all kind of bad things, including the page getting unmapped (MADV_DONTNEED), freed, reallocated as a larger folio and the unlock_page() would target the wrong bit. There is also another race with the FOLL_WRITE, which could race between the follow_page() and the get_locked_pte(). The main point is to remove the last use of follow_page() without FOLL_GET or FOLL_PIN, removing the races can be considered a nice bonus. Link: https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/ Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-Id: <20230428092753.27913-2-imbrenda@linux.ibm.com>
2023-05-04KVM: s390: pv: fix asynchronous teardown for small VMsClaudio Imbrenda2-0/+12
On machines without the Destroy Secure Configuration Fast UVC, the topmost level of page tables is set aside and freed asynchronously as last step of the asynchronous teardown. Each gmap has a host_to_guest radix tree mapping host (userspace) addresses (with 1M granularity) to gmap segment table entries (pmds). If a guest is smaller than 2GB, the topmost level of page tables is the segment table (i.e. there are only 2 levels). Replacing it means that the pointers in the host_to_guest mapping would become stale and cause all kinds of nasty issues. This patch fixes the issue by disallowing asynchronous teardown for guests with only 2 levels of page tables. Userspace should (and already does) try using the normal destroy if the asynchronous one fails. Update s390_replace_asce so it refuses to replace segment type ASCEs. This is still needed in case the normal destroy VM fails. Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot") Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-Id: <20230421085036.52511-2-imbrenda@linux.ibm.com>
2023-05-03mm/ksm: move disabling KSM from s390/gmap code to KSM codeDavid Hildenbrand1-19/+1
Let's factor out actual disabling of KSM. The existing "mm->def_flags &= ~VM_MERGEABLE;" was essentially a NOP and can be dropped, because def_flags should never include VM_MERGEABLE. Note that we don't currently prevent re-enabling KSM. This should now be faster in case KSM was never enabled, because we only conditionally iterate all VMAs. Further, it certainly looks cleaner. Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Stefan Roesch <shr@devkernel.io> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-28/+35
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-30Merge tag 's390-6.4-1' of ↵Linus Torvalds75-1887/+1971
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for stackleak feature. Also allow specifying architecture-specific stackleak poison function to enable faster implementation. On s390, the mvc-based implementation helps decrease typical overhead from a factor of 3 to just 25% - Convert all assembler files to use SYM* style macros, deprecating the ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS - Improve KASLR to also randomize module and special amode31 code base load addresses - Rework decompressor memory tracking to support memory holes and improve error handling - Add support for protected virtualization AP binding - Add support for set_direct_map() calls - Implement set_memory_rox() and noexec module_alloc() - Remove obsolete overriding of mem*() functions for KASAN - Rework kexec/kdump to avoid using nodat_stack to call purgatory - Convert the rest of the s390 code to use flexible-array member instead of a zero-length array - Clean up uaccess inline asm - Enable ARCH_HAS_MEMBARRIER_SYNC_CORE - Convert to using CONFIG_FUNCTION_ALIGNMENT and enable DEBUG_FORCE_FUNCTION_ALIGN_64B - Resolve last_break in userspace fault reports - Simplify one-level sysctl registration - Clean up branch prediction handling - Rework CPU counter facility to retrieve available counter sets just once - Other various small fixes and improvements all over the code * tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (118 commits) s390/stackleak: provide fast __stackleak_poison() implementation stackleak: allow to specify arch specific stackleak poison function s390: select ARCH_USE_SYM_ANNOTATIONS s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc() s390: wire up memfd_secret system call s390/mm: enable ARCH_HAS_SET_DIRECT_MAP s390/mm: use BIT macro to generate SET_MEMORY bit masks s390/relocate_kernel: adjust indentation s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc. s390/entry: use SYM* macros instead of ENTRY(), etc. s390/purgatory: use SYM* macros instead of ENTRY(), etc. s390/kprobes: use SYM* macros instead of ENTRY(), etc. s390/reipl: use SYM* macros instead of ENTRY(), etc. s390/head64: use SYM* macros instead of ENTRY(), etc. s390/earlypgm: use SYM* macros instead of ENTRY(), etc. s390/mcount: use SYM* macros instead of ENTRY(), etc. s390/crc32le: use SYM* macros instead of ENTRY(), etc. s390/crc32be: use SYM* macros instead of ENTRY(), etc. s390/crypto,chacha: use SYM* macros instead of ENTRY(), etc. s390/amode31: use SYM* macros instead of ENTRY(), etc. ...
2023-04-29locking/arch: Rename all internal __xchg() names to __arch_xchg()Andrzej Hajda1-4/+4
Decrease the probability of this internal facility to be used by driver code. Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [riscv] Link: https://lore.kernel.org/r/20230118154450.73842-1-andrzej.hajda@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-29Merge tag 'smp-core-2023-04-27' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP cross-CPU function-call updates from Ingo Molnar: - Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: trace,smp: Trace all smp_function_call*() invocations trace: Add trace_ipi_send_cpu() sched, smp: Trace smp callback causing an IPI smp: reword smp call IPI comment treewide: Trace IPIs sent via smp_send_reschedule() irq_work: Trace self-IPIs sent via arch_irq_work_raise() smp: Trace IPIs sent via arch_send_call_function_ipi_mask() sched, smp: Trace IPIs sent via send_call_function_single_ipi() trace: Add trace_ipi_send_cpumask() kernel/smp: Make csdlock_debug= resettable locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging locking/csd_lock: Remove added data from CSD lock debugging locking/csd_lock: Add Kconfig option for csd_debug default
2023-04-29Merge tag 'objtool-core-2023-04-27' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ...
2023-04-28Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds6-8/+42
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-28Merge tag 'modules-6.4-rc1' of ↵Linus Torvalds1-12/+14
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "The summary of the changes for this pull requests is: - Song Liu's new struct module_memory replacement - Nick Alcock's MODULE_LICENSE() removal for non-modules - My cleanups and enhancements to reduce the areas where we vmalloc module memory for duplicates, and the respective debug code which proves the remaining vmalloc pressure comes from userspace. Most of the changes have been in linux-next for quite some time except the minor fixes I made to check if a module was already loaded prior to allocating the final module memory with vmalloc and the respective debug code it introduces to help clarify the issue. Although the functional change is small it is rather safe as it can only *help* reduce vmalloc space for duplicates and is confirmed to fix a bootup issue with over 400 CPUs with KASAN enabled. I don't expect stable kernels to pick up that fix as the cleanups would have also had to have been picked up. Folks on larger CPU systems with modules will want to just upgrade if vmalloc space has been an issue on bootup. Given the size of this request, here's some more elaborate details: The functional change change in this pull request is the very first patch from Song Liu which replaces the 'struct module_layout' with a new 'struct module_memory'. The old data structure tried to put together all types of supported module memory types in one data structure, the new one abstracts the differences in memory types in a module to allow each one to provide their own set of details. This paves the way in the future so we can deal with them in a cleaner way. If you look at changes they also provide a nice cleanup of how we handle these different memory areas in a module. This change has been in linux-next since before the merge window opened for v6.3 so to provide more than a full kernel cycle of testing. It's a good thing as quite a bit of fixes have been found for it. Jason Baron then made dynamic debug a first class citizen module user by using module notifier callbacks to allocate / remove module specific dynamic debug information. Nick Alcock has done quite a bit of work cross-tree to remove module license tags from things which cannot possibly be module at my request so to: a) help him with his longer term tooling goals which require a deterministic evaluation if a piece a symbol code could ever be part of a module or not. But quite recently it is has been made clear that tooling is not the only one that would benefit. Disambiguating symbols also helps efforts such as live patching, kprobes and BPF, but for other reasons and R&D on this area is active with no clear solution in sight. b) help us inch closer to the now generally accepted long term goal of automating all the MODULE_LICENSE() tags from SPDX license tags In so far as a) is concerned, although module license tags are a no-op for non-modules, tools which would want create a mapping of possible modules can only rely on the module license tag after the commit 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"). Nick has been working on this *for years* and AFAICT I was the only one to suggest two alternatives to this approach for tooling. The complexity in one of my suggested approaches lies in that we'd need a possible-obj-m and a could-be-module which would check if the object being built is part of any kconfig build which could ever lead to it being part of a module, and if so define a new define -DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've suggested would be to have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as well but that means getting kconfig symbol names mapping to modules always, and I don't think that's the case today. I am not aware of Nick or anyone exploring either of these options. Quite recently Josh Poimboeuf has pointed out that live patching, kprobes and BPF would benefit from resolving some part of the disambiguation as well but for other reasons. The function granularity KASLR (fgkaslr) patches were mentioned but Joe Lawrence has clarified this effort has been dropped with no clear solution in sight [1]. In the meantime removing module license tags from code which could never be modules is welcomed for both objectives mentioned above. Some developers have also welcomed these changes as it has helped clarify when a module was never possible and they forgot to clean this up, and so you'll see quite a bit of Nick's patches in other pull requests for this merge window. I just picked up the stragglers after rc3. LWN has good coverage on the motivation behind this work [2] and the typical cross-tree issues he ran into along the way. The only concrete blocker issue he ran into was that we should not remove the MODULE_LICENSE() tags from files which have no SPDX tags yet, even if they can never be modules. Nick ended up giving up on his efforts due to having to do this vetting and backlash he ran into from folks who really did *not understand* the core of the issue nor were providing any alternative / guidance. I've gone through his changes and dropped the patches which dropped the module license tags where an SPDX license tag was missing, it only consisted of 11 drivers. To see if a pull request deals with a file which lacks SPDX tags you can just use: ./scripts/spdxcheck.py -f \ $(git diff --name-only commid-id | xargs echo) You'll see a core module file in this pull request for the above, but that's not related to his changes. WE just need to add the SPDX license tag for the kernel/module/kmod.c file in the future but it demonstrates the effectiveness of the script. Most of Nick's changes were spread out through different trees, and I just picked up the slack after rc3 for the last kernel was out. Those changes have been in linux-next for over two weeks. The cleanups, debug code I added and final fix I added for modules were motivated by David Hildenbrand's report of boot failing on a systems with over 400 CPUs when KASAN was enabled due to running out of virtual memory space. Although the functional change only consists of 3 lines in the patch "module: avoid allocation if module is already present and ready", proving that this was the best we can do on the modules side took quite a bit of effort and new debug code. The initial cleanups I did on the modules side of things has been in linux-next since around rc3 of the last kernel, the actual final fix for and debug code however have only been in linux-next for about a week or so but I think it is worth getting that code in for this merge window as it does help fix / prove / evaluate the issues reported with larger number of CPUs. Userspace is not yet fixed as it is taking a bit of time for folks to understand the crux of the issue and find a proper resolution. Worst come to worst, I have a kludge-of-concept [3] of how to make kernel_read*() calls for modules unique / converge them, but I'm currently inclined to just see if userspace can fix this instead" Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0] Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1] Link: https://lwn.net/Articles/927569/ [2] Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3] * tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits) module: add debugging auto-load duplicate module support module: stats: fix invalid_mod_bytes typo module: remove use of uninitialized variable len module: fix building stats for 32-bit targets module: stats: include uapi/linux/module.h module: avoid allocation if module is already present and ready module: add debug stats to help identify memory pressure module: extract patient module check into helper modules/kmod: replace implementation with a semaphore Change DEFINE_SEMAPHORE() to take a number argument module: fix kmemleak annotations for non init ELF sections module: Ignore L0 and rename is_arm_mapping_symbol() module: Move is_arm_mapping_symbol() to module_symbol.h module: Sync code of is_arm_mapping_symbol() scripts/gdb: use mem instead of core_layout to get the module address interconnect: remove module-related code interconnect: remove MODULE_LICENSE in non-modules zswap: remove MODULE_LICENSE in non-modules zpool: remove MODULE_LICENSE in non-modules x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules ...
2023-04-27Merge tag 'driver-core-6.4-rc1' of ↵Linus Torvalds2-4/+19
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...
2023-04-27Merge tag 'net-next-6.4' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ...
2023-04-26Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds1-0/+2
Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
2023-04-26Merge tag 'kvmarm-6.4' of ↵Paolo Bonzini7-34/+32
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.4 - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements.
2023-04-26Merge tag 'kvm-s390-next-6.4-1' of ↵Paolo Bonzini3-24/+32
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD Minor cleanup: - phys_to_virt conversion - Improvement of VSIE AP management
2023-04-25Merge tag 'arm64-upstream' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "ACPI: - Improve error reporting when failing to manage SDEI on AGDI device removal Assembly routines: - Improve register constraints so that the compiler can make use of the zero register instead of moving an immediate #0 into a GPR - Allow the compiler to allocate the registers used for CAS instructions CPU features and system registers: - Cleanups to the way in which CPU features are identified from the ID register fields - Extend system register definition generation to handle Enum types when defining shared register fields - Generate definitions for new _EL2 registers and add new fields for ID_AA64PFR1_EL1 - Allow SVE to be disabled separately from SME on the kernel command-line Tracing: - Support for "direct calls" in ftrace, which enables BPF tracing for arm64 Kdump: - Don't bother unmapping the crashkernel from the linear mapping, which then allows us to use huge (block) mappings and reduce TLB pressure when a crashkernel is loaded. Memory management: - Try again to remove data cache invalidation from the coherent DMA allocation path - Simplify the fixmap code by mapping at page granularity - Allow the kfence pool to be allocated early, preventing the rest of the linear mapping from being forced to page granularity Perf and PMU: - Move CPU PMU code out to drivers/perf/ where it can be reused by the 32-bit ARM architecture when running on ARMv8 CPUs - Fix race between CPU PMU probing and pKVM host de-privilege - Add support for Apple M2 CPU PMU - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event dynamically, depending on what the CPU actually supports - Minor fixes and cleanups to system PMU drivers Stack tracing: - Use the XPACLRI instruction to strip PAC from pointers, rather than rolling our own function in C - Remove redundant PAC removal for toolchains that handle this in their builtins - Make backtracing more resilient in the face of instrumentation Miscellaneous: - Fix single-step with KGDB - Remove harmless warning when 'nokaslr' is passed on the kernel command-line - Minor fixes and cleanups across the board" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege arm64: kexec: include reboot.h arm64: delete dead code in this_cpu_set_vectors() arm64/cpufeature: Use helper macro to specify ID register for capabilites drivers/perf: hisi: add NULL check for name drivers/perf: hisi: Remove redundant initialized of pmu->name arm64/cpufeature: Consistently use symbolic constants for min_field_value arm64/cpufeature: Pull out helper for CPUID register definitions arm64/sysreg: Convert HFGITR_EL2 to automatic generation ACPI: AGDI: Improve error reporting for problems during .remove() arm64: kernel: Fix kernel warning when nokaslr is passed to commandline perf/arm-cmn: Fix port detection for CMN-700 arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step arm64: move PAC masks to <asm/pointer_auth.h> arm64: use XPACLRI to strip PAC arm64: avoid redundant PAC stripping in __builtin_return_address() arm64/sme: Fix some comments of ARM SME arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2() arm64/signal: Use system_supports_tpidr2() to check TPIDR2 arm64/idreg: Don't disable SME when disabling SVE ...
2023-04-25Merge tag 'timers-core-2023-04-24' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations
2023-04-24Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
2023-04-22Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+5
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-21 We've added 71 non-merge commits during the last 8 day(s) which contain a total of 116 files changed, 13397 insertions(+), 8896 deletions(-). The main changes are: 1) Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward, from Florian Westphal. 2) Fix race between btf_put and btf_idr walk which caused a deadlock, from Alexei Starovoitov. 3) Second big batch to migrate test_verifier unit tests into test_progs for ease of readability and debugging, from Eduard Zingerman. 4) Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree, from Dave Marchevsky. 5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF selftests into libbpf-provided bpf_helpers.h header and improve kfunc handling, from Andrii Nakryiko. 6) Support 64-bit pointers to kfuncs needed for archs like s390x, from Ilya Leoshkevich. 7) Support BPF progs under getsockopt with a NULL optval, from Stanislav Fomichev. 8) Improve verifier u32 scalar equality checking in order to enable LLVM transformations which earlier had to be disabled specifically for BPF backend, from Yonghong Song. 9) Extend bpftool's struct_ops object loading to support links, from Kui-Feng Lee. 10) Add xsk selftest follow-up fixes for hugepage allocated umem, from Magnus Karlsson. 11) Support BPF redirects from tc BPF to ifb devices, from Daniel Borkmann. 12) Add BPF support for integer type when accessing variable length arrays, from Feng Zhou. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits) selftests/bpf: verifier/value_ptr_arith converted to inline assembly selftests/bpf: verifier/value_illegal_alu converted to inline assembly selftests/bpf: verifier/unpriv converted to inline assembly selftests/bpf: verifier/subreg converted to inline assembly selftests/bpf: verifier/spin_lock converted to inline assembly selftests/bpf: verifier/sock converted to inline assembly selftests/bpf: verifier/search_pruning converted to inline assembly selftests/bpf: verifier/runtime_jit converted to inline assembly selftests/bpf: verifier/regalloc converted to inline assembly selftests/bpf: verifier/ref_tracking converted to inline assembly selftests/bpf: verifier/map_ptr_mixing converted to inline assembly selftests/bpf: verifier/map_in_map converted to inline assembly selftests/bpf: verifier/lwt converted to inline assembly selftests/bpf: verifier/loops1 converted to inline assembly selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly selftests/bpf: verifier/direct_packet_access converted to inline assembly selftests/bpf: verifier/d_path converted to inline assembly selftests/bpf: verifier/ctx converted to inline assembly selftests/bpf: verifier/btf_ctx_access converted to inline assembly selftests/bpf: verifier/bpf_get_stack converted to inline assembly ... ==================== Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-22mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()Linus Torvalds2-2/+2
Instead of having callers care about the mmap_min_addr logic for the lowest valid mapping address (and some of them getting it wrong), just move the logic into vm_unmapped_area() itself. One less thing for various architecture cases (and generic helpers) to worry about. We should really try to make much more of this be common code, but baby steps.. Without this, vm_unmapped_area() could return an address below mmap_min_addr (because some caller forgot about that). That then causes the mmap machinery to think it has found a workable address, but then later security_mmap_addr(addr) is unhappy about it and the mmap() returns with a nonsensical error (EPERM). The proper action is to either return ENOMEM (if the virtual address space is exhausted), or try to find another address (ie do a bottom-up search for free addresses after the top-down one failed). See commit 2afc745f3e30 ("mm: ensure get_unmapped_area() returns higher address than mmap_min_addr"), which fixed this for one call site (the generic arch_get_unmapped_area_topdown() fallback) but left other cases alone. Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-22mm: add new api to enable ksm per processStefan Roesch1-0/+7
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-20KVM: s390: pci: fix virtual-physical confusion on module unload/loadNico Boehr1-1/+1
When the kvm module is unloaded, zpci_setup_aipb() perists some data in the zpci_aipb structure in s390 pci code. Note that this struct is also passed to firmware in the zpci_set_irq_ctrl() call and thus the GAIT must be a physical address. On module re-insertion, the GAIT is restored from this structure in zpci_reset_aipb(). But it is a physical address, hence this may cause issues when the kvm module is unloaded and loaded again. Fix virtual vs physical address confusion (which currently are the same) by adding the necessary physical-to-virtual-conversion in zpci_reset_aipb(). Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230222155503.43399-1-nrb@linux.ibm.com Message-Id: <20230222155503.43399-1-nrb@linux.ibm.com>
2023-04-20KVM: s390: vsie: clarifications on setting the APCBPierre Morel1-21/+29
The APCB is part of the CRYCB. The calculation of the APCB origin can be done by adding the APCB offset to the CRYCB origin. Current code makes confusing transformations, converting the CRYCB origin to a pointer to calculate the APCB origin. Let's make things simpler and keep the CRYCB origin to make these calculations. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com>
2023-04-20KVM: s390: interrupt: fix virtual-physical confusion for next alert GISANico Boehr1-2/+2
We sometimes put a virtual address in next_alert, which should always be a physical address, since it is shared with hardware. This currently works, because virtual and physical addresses are the same. Add phys_to_virt() to resolve the virtual-physical confusion. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230223162236.51569-1-nrb@linux.ibm.com Message-Id: <20230223162236.51569-1-nrb@linux.ibm.com>
2023-04-20s390/stackleak: provide fast __stackleak_poison() implementationHeiko Carstens1-0/+35
Provide an s390 specific __stackleak_poison() implementation which is faster than the generic variant. For the original implementation with an enforced 4kb stackframe for the getpid() system call the system call overhead increases by a factor of 3 if the stackleak feature is enabled. Using the s390 mvc based variant this is reduced to an increase of 25% instead. This is within the expected area, since the mvc based implementation is more or less a memset64() variant which comes with similar results. See commit 0b77d6701cf8 ("s390: implement memset16, memset32 & memset64"). Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20230405130841.1350565-3-hca@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-20s390: select ARCH_USE_SYM_ANNOTATIONSHeiko Carstens1-0/+1
All old style assembly annotations have been converted for s390. Select ARCH_USE_SYM_ANNOTATIONS to make sure the old macros like ENTRY() aren't available anymore. This prevents that new code which uses the old macros will be added again. This follows what has been done for x86 with commit 2ce0d7f9766f ("x86/asm: Provide a Kconfig symbol for disabling old assembly annotations") and for arm64 with commit 50479d58eaa3 ("arm64: Disable old style assembly annotations"). Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-20s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc()Heiko Carstens2-8/+54
Make use of the set_direct_map() calls for module allocations. In particular: - All changes to read-only permissions in kernel VA mappings are also applied to the direct mapping. Note that execute permissions are intentionally not applied to the direct mapping in order to make sure that all allocated pages within the direct mapping stay non-executable - module_alloc() passes the VM_FLUSH_RESET_PERMS to __vmalloc_node_range() to make sure that all implicit permission changes made to the direct mapping are reset when the allocated vm area is freed again Side effects: the direct mapping will be fragmented depending on how many vm areas with VM_FLUSH_RESET_PERMS and/or explicit page permission changes are allocated and freed again. For example, just after boot of a system the direct mapping statistics look like: $cat /proc/meminfo ... DirectMap4k: 111628 kB DirectMap1M: 16665600 kB DirectMap2G: 0 kB Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-20s390: wire up memfd_secret system callHeiko Carstens1-1/+1
s390 supports ARCH_HAS_SET_DIRECT_MAP, therefore wire up the memfd_secret system call, which depends on it. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-20s390/mm: enable ARCH_HAS_SET_DIRECT_MAPHeiko Carstens3-0/+43
Implement the set_direct_map_*() API, which allows to invalidate and set default permissions to pages within the direct mapping. Note that kernel_page_present(), which is also supposed to be part of this API, is intentionally not implemented. The reason for this is that kernel_page_present() is only used (and currently only makes sense) for suspend/resume, which isn't supported on s390. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-20s390/mm: use BIT macro to generate SET_MEMORY bit masksHeiko Carstens1-5/+13
Use BIT macro to generate SET_MEMORY bit masks, which is easier to maintain if bits get added, or removed. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-19s390/relocate_kernel: adjust indentationHeiko Carstens1-42/+42
relocate_kernel.S seems to be the only assembler file which doesn't follow the standard way of indentation. Adjust this for the sake of consistency. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-19s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc.Heiko Carstens1-9/+7
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-19s390/entry: use SYM* macros instead of ENTRY(), etc.Heiko Carstens1-35/+34
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-04-19s390/purgatory: use SYM* macros instead of ENTRY(), etc.Heiko Carstens2-51/+25
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>