summaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)AuthorFilesLines
2021-07-14Merge tag 'net-5.14-rc2' of ↵Linus Torvalds4-5/+33
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski. "Including fixes from bpf and netfilter. Current release - regressions: - sock: fix parameter order in sock_setsockopt() Current release - new code bugs: - netfilter: nft_last: - fix incorrect arithmetic when restoring last used - honor NFTA_LAST_SET on restoration Previous releases - regressions: - udp: properly flush normal packet at GRO time - sfc: ensure correct number of XDP queues; don't allow enabling the feature if there isn't sufficient resources to Tx from any CPU - dsa: sja1105: fix address learning getting disabled on the CPU port - mptcp: addresses a rmem accounting issue that could keep packets in subflow receive buffers longer than necessary, delaying MPTCP-level ACKs - ip_tunnel: fix mtu calculation for ETHER tunnel devices - do not reuse skbs allocated from skbuff_fclone_cache in the napi skb cache, we'd try to return them to the wrong slab cache - tcp: consistently disable header prediction for mptcp Previous releases - always broken: - bpf: fix subprog poke descriptor tracking use-after-free - ipv6: - allocate enough headroom in ip6_finish_output2() in case iptables TEE is used - tcp: drop silly ICMPv6 packet too big messages to avoid expensive and pointless lookups (which may serve as a DDOS vector) - make sure fwmark is copied in SYNACK packets - fix 'disable_policy' for forwarded packets (align with IPv4) - netfilter: conntrack: - do not renew entry stuck in tcp SYN_SENT state - do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry - mptcp: cleanly handle error conditions with MP_JOIN and syncookies - mptcp: fix double free when rejecting a join due to port mismatch - validate lwtstate->data before returning from skb_tunnel_info() - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path - mt76: mt7921: continue to probe driver when fw already downloaded - bonding: fix multiple issues with offloading IPsec to (thru?) bond - stmmac: ptp: fix issues around Qbv support and setting time back - bcmgenet: always clear wake-up based on energy detection Misc: - sctp: move 198 addresses from unusable to private scope - ptp: support virtual clocks and timestamping - openvswitch: optimize operation for key comparison" * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits) net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave() sfc: add logs explaining XDP_TX/REDIRECT is not available sfc: ensure correct number of XDP queues sfc: fix lack of XDP TX queues - error XDP TX failed (-22) net: fddi: fix UAF in fza_probe net: dsa: sja1105: fix address learning getting disabled on the CPU port net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload net: Use nlmsg_unicast() instead of netlink_unicast() octeontx2-pf: Fix uninitialized boolean variable pps ipv6: allocate enough headroom in ip6_finish_output2() net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific net: bridge: multicast: fix MRD advertisement router port marking race net: bridge: multicast: fix PIM hello router port marking race net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340 dsa: fix for_each_child.cocci warnings virtio_net: check virtqueue_add_sgs() return value mptcp: properly account bulk freed memory selftests: mptcp: fix case multiple subflows limited by server mptcp: avoid processing packet if a subflow reset mptcp: fix syncookie process if mptcp can not_accept new subflow ...
2021-07-09Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+1
Pull more block updates from Jens Axboe: "A combination of changes that ended up depending on both the driver and core branch (and/or the IDE removal), and a few late arriving fixes. In detail: - Fix io ticks wrap-around issue (Chunguang) - nvme-tcp sock locking fix (Maurizio) - s390-dasd fixes (Kees, Christoph) - blk_execute_rq polling support (Keith) - blk-cgroup RCU iteration fix (Yu) - nbd backend ID addition (Prasanna) - Partition deletion fix (Yufen) - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph) - Removal of now dead block request types due to IDE removal (Christoph) - Loop probing and control device cleanups (Christoph) - Device uevent fix (Christoph) - Misc cleanups/fixes (Tetsuo, Christoph)" * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits) blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs block: fix the problem of io_ticks becoming smaller nvme-tcp: can't set sk_user_data without write_lock loop: remove unused variable in loop_set_status() block: remove the bdgrab in blk_drop_partitions block: grab a device refcount in disk_uevent s390/dasd: Avoid field over-reading memcpy() dasd: unexport dasd_set_target_state block: check disk exist before trying to add partition ubd: remove dead code in ubd_setup_common nvme: use return value from blk_execute_rq() block: return errors from blk_execute_rq() nvme: use blk_execute_rq() for passthrough commands block: support polling through blk_execute_rq block: remove REQ_OP_SCSI_{IN,OUT} block: mark blk_mq_init_queue_data static loop: rewrite loop_exit using idr_for_each_entry loop: split loop_lookup loop: don't allow deleting an unspecified loop device loop: move loop_ctl_mutex locking into loop_add ...
2021-07-09Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-0/+12
Pull virtio,vhost,vdpa updates from Michael Tsirkin: - Doorbell remapping for ifcvf, mlx5 - virtio_vdpa support for mlx5 - Validate device input in several drivers (for SEV and friends) - ZONE_MOVABLE aware handling in virtio-mem - Misc fixes, cleanups * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block Mode virtio-mem: simplify high-level unplug handling in Big Block Mode virtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block Mode virtio-mem: simplify high-level unplug handling in Sub Block Mode virtio-mem: simplify high-level plug handling in Sub Block Mode virtio-mem: use page_zonenum() in virtio_mem_fake_offline() virtio-mem: don't read big block size in Sub Block Mode virtio/vdpa: clear the virtqueue state during probe vp_vdpa: allow set vq state to initial state after reset virtio-pci library: introduce vp_modern_get_driver_features() vdpa: support packed virtqueue for set/get_vq_state() virtio-ring: store DMA metadata in desc_extra for split virtqueue virtio: use err label in __vring_new_virtqueue() virtio_ring: introduce virtqueue_desc_add_split() virtio_ring: secure handling of mapping errors virtio-ring: factor out desc_extra allocation virtio_ring: rename vring_desc_extra_packed virtio-ring: maintain next in extra state for packed virtqueue vdpa/mlx5: Clear vq ready indication upon device reset vdpa/mlx5: Add support for doorbell bypassing ...
2021-07-09Merge tag 'for-linus-5.14-rc1' of ↵Linus Torvalds1-0/+64
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for optimized routines based on the host CPU - Support for PCI via virtio - Various fixes * tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: remove unneeded semicolon in um_arch.c um: Remove the repeated declaration um: fix error return code in winch_tramp() um: fix error return code in slip_open() um: Fix stack pointer alignment um: implement flush_cache_vmap/flush_cache_vunmap um: add a UML specific futex implementation um: enable the use of optimized xor routines in UML um: Add support for host CPU flags and alignment um: allow not setting extra rpaths in the linux binary um: virtio/pci: enable suspend/resume um: add PCI over virtio emulation driver um: irqs: allow invoking time-travel handler multiple times um: time-travel/signals: fix ndelay() in interrupt um: expose time-travel mode to userspace side um: export signals_enabled directly um: remove unused smp_sigio_handler() declaration lib: add iomem emulation (logic_iomem) um: allow disabling NO_IOMEM
2021-07-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-1/+7
Pull yet more updates from Andrew Morton: "54 patches. Subsystems affected by this patch series: lib, mm (slub, secretmem, cleanups, init, pagemap, and mremap), and debug" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (54 commits) powerpc/mm: enable HAVE_MOVE_PMD support powerpc/book3s64/mm: update flush_tlb_range to flush page walk cache mm/mremap: allow arch runtime override mm/mremap: hold the rmap lock in write mode when moving page table entries. mm/mremap: use pmd/pud_poplulate to update page table entries mm/mremap: don't enable optimized PUD move if page table levels is 2 mm/mremap: convert huge PUD move to separate helper selftest/mremap_test: avoid crash with static build selftest/mremap_test: update the test to handle pagesize other than 4K mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t * mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t * kdump: use vmlinux_build_id to simplify buildid: fix kernel-doc notation buildid: mark some arguments const scripts/decode_stacktrace.sh: indicate 'auto' can be used for base path scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm scripts/decode_stacktrace.sh: support debuginfod x86/dumpstack: use %pSb/%pBb for backtrace printing arm64: stacktrace: use %pSb for backtrace printing module: add printk formats to add module build ID to stacktraces ...
2021-07-08Merge tag 'pci-v5.14-changes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Fix dsm_label_utf16s_to_utf8s() buffer overrun (Krzysztof Wilczyński) - Rely on lengths from scnprintf(), dsm_label_utf16s_to_utf8s() (Krzysztof Wilczyński) - Use sysfs_emit() and sysfs_emit_at() in "show" functions (Krzysztof Wilczyński) - Fix 'resource_alignment' newline issues (Krzysztof Wilczyński) - Add 'devspec' newline (Krzysztof Wilczyński) - Dynamically map ECAM regions (Russell King) Resource management: - Coalesce host bridge contiguous apertures (Kai-Heng Feng) PCIe native device hotplug: - Ignore Link Down/Up caused by DPC (Lukas Wunner) Power management: - Leave Apple Thunderbolt controllers on for s2idle or standby (Konstantin Kharlamov) Virtualization: - Work around Huawei Intelligent NIC VF FLR erratum (Chiqijun) - Clarify error message for unbound IOV devices (Moritz Fischer) - Add pci_reset_bus_function() Secondary Bus Reset interface (Raphael Norwitz) Peer-to-peer DMA: - Simplify distance calculation (Christoph Hellwig) - Finish RCU conversion of pdev->p2pdma (Eric Dumazet) - Rename upstream_bridge_distance() and rework doc (Logan Gunthorpe) - Collect acs list in stack buffer to avoid sleeping (Logan Gunthorpe) - Use correct calc_map_type_and_dist() return type (Logan Gunthorpe) - Warn if host bridge not in whitelist (Logan Gunthorpe) - Refactor pci_p2pdma_map_type() (Logan Gunthorpe) - Avoid pci_get_slot(), which may sleep (Logan Gunthorpe) Altera PCIe controller driver: - Add Joyce Ooi as Altera PCIe maintainer (Joyce Ooi) Broadcom iProc PCIe controller driver: - Fix multi-MSI base vector number allocation (Sandor Bodo-Merle) - Support multi-MSI only on uniprocessor kernel (Sandor Bodo-Merle) Freescale i.MX6 PCIe controller driver: - Limit DBI register length for imx6qp PCIe (Richard Zhu) - Add "vph-supply" for PHY supply voltage (Richard Zhu) - Enable PHY internal regulator when supplied >3V (Richard Zhu) - Remove imx6_pcie_probe() redundant error message (Zhen Lei) Intel Gateway PCIe controller driver: - Fix INTx enable (Martin Blumenstingl) Marvell Aardvark PCIe controller driver: - Fix checking for PIO Non-posted Request (Pali Rohár) - Implement workaround for the readback value of VEND_ID (Pali Rohár) MediaTek PCIe controller driver: - Remove redundant error printing in mtk_pcie_subsys_powerup() (Zhen Lei) MediaTek PCIe Gen3 controller driver: - Add missing MODULE_DEVICE_TABLE (Zou Wei) Microchip PolarFlare PCIe controller driver: - Make struct event_descs static (Krzysztof Wilczyński) Microsoft Hyper-V host bridge driver: - Fix race condition when removing the device (Long Li) - Remove bus device removal unused refcount/functions (Long Li) Mobiveil PCIe controller driver: - Remove unused readl and writel functions (Krzysztof Wilczyński) NVIDIA Tegra PCIe controller driver: - Add missing MODULE_DEVICE_TABLE (Zou Wei) NVIDIA Tegra194 PCIe controller driver: - Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift (Jon Hunter) - Fix host initialization during resume (Vidya Sagar) Rockchip PCIe controller driver: - Register IRQ handlers after device and data are ready (Javier Martinez Canillas)" * tag 'pci-v5.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (48 commits) PCI/P2PDMA: Finish RCU conversion of pdev->p2pdma PCI: xgene: Annotate __iomem pointer PCI: Fix kernel-doc formatting PCI: cpcihp: Declare cpci_debug in header file MAINTAINERS: Add Joyce Ooi as Altera PCIe maintainer PCI: rockchip: Register IRQ handlers after device and data are ready PCI: tegra194: Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift PCI: aardvark: Implement workaround for the readback value of VEND_ID PCI: aardvark: Fix checking for PIO Non-posted Request PCI: tegra194: Fix host initialization during resume PCI: tegra: Add missing MODULE_DEVICE_TABLE PCI: imx6: Enable PHY internal regulator when supplied >3V dt-bindings: imx6q-pcie: Add "vph-supply" for PHY supply voltage PCI: imx6: Limit DBI register length for imx6qp PCIe PCI: imx6: Remove imx6_pcie_probe() redundant error message PCI: intel-gw: Fix INTx enable PCI: iproc: Support multi-MSI only on uniprocessor kernel PCI: iproc: Fix multi-MSI base vector number allocation PCI: mediatek-gen3: Add missing MODULE_DEVICE_TABLE PCI: Dynamically map ECAM regions ...
2021-07-08arch, mm: wire up memfd_secret system call where relevantMike Rapoport1-1/+6
Wire up memfd_secret system call on architectures that define ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86. Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08mm: introduce memfd_secret system call to create "secret" memory areasMike Rapoport1-0/+1
Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The secretmem feature is off by default and the user must explicitly enable it at the boot time. Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor. Secretmem is designed to provide the following protections: * Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks. * Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP. * Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace. The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explicit and controlled sharing of the memory areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userspace hipervisor process, for instance QEMU. Andy Lutomirski says: "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." memfd_secret() is made a dedicated system call rather than an extension to memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs. The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration. Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings. However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA. A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ [akpm@linux-foundation.org: suppress Kconfig whine] Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-07Merge tag 'x86-fpu-2021-07-07' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...
2021-07-07netfilter: uapi: refer to nfnetlink_conntrack.h, not nf_conntrack_netlink.hDuncan Roe2-3/+3
nf_conntrack_netlink.h does not exist, refer to nfnetlink_conntrack.h instead. Signed-off-by: Duncan Roe <duncan_roe@optusnet.com.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06Merge tag 'fuse-update-5.14' of ↵Linus Torvalds1-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fixes for virtiofs submounts - Misc fixes and cleanups * tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtiofs: Fix spelling mistakes fuse: use DIV_ROUND_UP helper macro for calculations fuse: fix illegal access to inode with reused nodeid fuse: allow fallocate(FALLOC_FL_ZERO_RANGE) fuse: Make fuse_fill_super_submount() static fuse: Switch to fc_mount() for submounts fuse: Call vfs_get_tree() for submounts fuse: add dedicated filesystem context ops for submounts virtiofs: propagate sync() to file server fuse: reject internal errno fuse: check connected before queueing on fpq->io fuse: ignore PG_workingset after stealing fuse: Fix infinite loop in sget_fc() fuse: Fix crash if superblock of submount gets killed early fuse: Fix crash in fuse_dentry_automount() error path
2021-07-06PCI: Fix kernel-doc formattingKrzysztof Wilczyński1-1/+1
Fix kernel-doc formatting throughout drivers/pci and related include files. No change to functionality intended. Check for warnings: $ find include drivers/pci -type f -path "*pci*.[ch]" | xargs scripts/kernel-doc -none [bhelgaas: squashed to one commit] Link: https://lore.kernel.org/r/20210509030237.368540-1-kw@linux.com Link: https://lore.kernel.org/r/20210703151306.1922450-1-kw@linux.com Link: https://lore.kernel.org/r/20210703151306.1922450-2-kw@linux.com Link: https://lore.kernel.org/r/20210703151306.1922450-3-kw@linux.com Link: https://lore.kernel.org/r/20210703151306.1922450-4-kw@linux.com Link: https://lore.kernel.org/r/20210703151306.1922450-5-kw@linux.com Signed-off-by: Krzysztof Wilczyński <kw@linux.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-07-06Merge tag 'tty-5.14-rc1' of ↵Linus Torvalds1-99/+0
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is the big set of tty and serial driver patches for 5.14-rc1. A bit more than normal, but nothing major, lots of cleanups. Highlights are: - lots of tty api cleanups and mxser driver cleanups from Jiri - build warning fixes - various serial driver updates - coding style cleanups - various tty driver minor fixes and updates - removal of broken and disable r3964 line discipline (finally!) All of these have been in linux-next for a while with no reported issues" * tag 'tty-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (227 commits) serial: mvebu-uart: remove unused member nb from struct mvebu_uart arm64: dts: marvell: armada-37xx: Fix reg for standard variant of UART dt-bindings: mvebu-uart: fix documentation serial: mvebu-uart: correctly calculate minimal possible baudrate serial: mvebu-uart: do not allow changing baudrate when uartclk is not available serial: mvebu-uart: fix calculation of clock divisor tty: make linux/tty_flip.h self-contained serial: Prefer unsigned int to bare use of unsigned serial: 8250: 8250_omap: Fix possible interrupt storm on K3 SoCs serial: qcom_geni_serial: use DT aliases according to DT bindings Revert "tty: serial: Add UART driver for Cortina-Access platform" tty: serial: Add UART driver for Cortina-Access platform MAINTAINERS: add me back as mxser maintainer mxser: Documentation, fix typos mxser: Documentation, make the docs up-to-date mxser: Documentation, remove traces of callout device mxser: introduce mxser_16550A_or_MUST helper mxser: rename flags to old_speed in mxser_set_serial_info mxser: use port variable in mxser_set_serial_info mxser: access info->MCR under info->slock ...
2021-07-05Merge tag 'char-misc-5.14-rc1' of ↵Linus Torvalds2-17/+13
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc driver updates from Greg KH: "Here is the big set of char / misc and other driver subsystem updates for 5.14-rc1. Included in here are: - habanalabs driver updates - fsl-mc driver updates - comedi driver updates - fpga driver updates - extcon driver updates - interconnect driver updates - mei driver updates - nvmem driver updates - phy driver updates - pnp driver updates - soundwire driver updates - lots of other tiny driver updates for char and misc drivers This is looking more and more like the "various driver subsystems mushed together" tree... All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (292 commits) mcb: Use DEFINE_RES_MEM() helper macro and fix the end address PNP: moved EXPORT_SYMBOL so that it immediately followed its function/variable bus: mhi: pci-generic: Add missing 'pci_disable_pcie_error_reporting()' calls bus: mhi: Wait for M2 state during system resume bus: mhi: core: Fix power down latency intel_th: Wait until port is in reset before programming it intel_th: msu: Make contiguous buffers uncached intel_th: Remove an unused exit point from intel_th_remove() stm class: Spelling fix nitro_enclaves: Set Bus Master for the NE PCI device misc: ibmasm: Modify matricies to matrices misc: vmw_vmci: return the correct errno code siox: Simplify error handling via dev_err_probe() fpga: machxo2-spi: Address warning about unused variable lkdtm/heap: Add init_on_alloc tests selftests/lkdtm: Enable various testable CONFIGs lkdtm: Add CONFIG hints in errors where possible lkdtm: Enable DOUBLE_FAULT on all architectures lkdtm/heap: Add vmalloc linear overflow test lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE ...
2021-07-04Merge tag 'cxl-for-5.14' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL (Compute Express Link) updates from Dan Williams: "This subsystem is still in the build-out phase as the bulk of the update is improvements to enumeration and fleshing out the device model. In terms of new features, more mailbox commands have been added to the allowed-list in support of persistent memory provisioning support targeting v5.15. The critical update from an enumeration perspective is support for the CXL Fixed Memory Window Structure that indicates to Linux which system physical address ranges decode to the CXL Host Bridges in the system. This allows the driver to detect which address ranges have been mapped by firmware and what address ranges are available for future hotplug. So, again, mostly skeleton this round, with more meat targeting v5.15. Summary: - Add support for the CXL Fixed Memory Window Structure, a recent extension of the ACPI CEDT (CXL Early Discovery Table) - Add infrastructure for component registers - Add HDM (Host-managed device memory) decoder definitions - Define a device model for an HDM decoder tree - Bridge CXL persistent memory capabilities to an NVDIMM bus / device-model - Switch to fine grained mapping of CXL MMIO registers to allow different drivers / system software to own individual register blocks - Enable media provisioning commands, and publish the label storage area size in sysfs - Miscellaneous cleanups and fixes" * tag 'cxl-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (34 commits) cxl/pci: Rename CXL REGLOC ID cxl/acpi: Use the ACPI CFMWS to create static decoder objects cxl/acpi: Add the Host Bridge base address to CXL port objects cxl/pmem: Register 'pmem' / cxl_nvdimm devices libnvdimm: Drop unused device power management support libnvdimm: Export nvdimm shutdown helper, nvdimm_delete() cxl/pmem: Add initial infrastructure for pmem support cxl/core: Add cxl-bus driver infrastructure cxl/pci: Add media provisioning required commands cxl/component_regs: Fix offset cxl/hdm: Fix decoder count calculation cxl/acpi: Introduce cxl_decoder objects cxl/acpi: Enumerate host bridge root ports cxl/acpi: Add downstream port data to cxl_port instances cxl/Kconfig: Default drivers to CONFIG_CXL_BUS cxl/acpi: Introduce the root of a cxl_port topology cxl/pci: Fixup devm_cxl_iomap_block() to take a 'struct device *' cxl/pci: Add HDM decoder capabilities cxl/pci: Reserve individual register block regions cxl/pci: Map registers based on capabilities ...
2021-07-03virtio: update virtio id table, add transitional idsZhu Lingshan1-0/+12
This commit updates virtio id table by adding transitional device ids Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Link: https://lore.kernel.org/r/20210510081015.4212-2-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-03Merge tag 'sound-5.14-rc1' of ↵Linus Torvalds1-2/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "As the diffstat scatters over the tree, we've got many tree-wide small changes, but also got quite a few intrusive changes in the core side. The only ABI-visible core change is the new rawmidi framing mode support while others are kernel-internal, mostly code refactoring and/or nice improvements. Here are some highlights: Core: - A new framing access mode for rawmidi to get timestamps - Cleanup / refactoring of buffer memory management helper code - Support for automatic negotiation of ASoC DAI formats - Revival of software suspend for PCM and control core, as a preliminary work for PCI BAR rescan support ASoC: - Accessory detection support for several Qualcomm parts - Support for IEC958 control with hdmi-codec - Merging of Tegra machine drivers into a single driver - Support for AmLogic SM1 TOACODEC, Intel AlderLake-M, several NXP i.MX8 variants, NXP TFA1 and TDF9897, Rockchip RK817, Qualcomm Quinary MI2S, Texas Instruments TAS2505 USB-audio: - Reduction of latency at playback start - Code cleanup / fixes of usx2y driver - Scarlett2 mixer code fixes and enhancements - Quirks for Ozone and Denon devices HD-audio: - A few quirks for HP and ASUS machines - Display power management fixes Others: - FireWire code refactoring and enhancements - Tree-wide trivial coding-style fixes" * tag 'sound-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (594 commits) ALSA: usb-audio: scarlett2: Fix for loop increment in scarlett2_usb_get_config ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 630 G8 ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 445 G8 ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 450 G8 ALSA: hda/realtek - Add ALC285 HP init procedure ALSA: hda/realtek - Add type for ALC287 ALSA: scarlett2: Fix scarlett2_*_ctl_put() return values again ALSA: scarlett2: Fix pad count for 18i8 Gen 3 ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 830 G8 Notebook PC ALSA: firewire-lib: Fix 'amdtp_domain_start()' when no AMDTP_OUT_STREAM stream is found ASoC: qcom: lpass-cpu: mark IRQ_CLEAR register as volatile and readable ALSA: hda: Release codec display power during shutdown/reboot ALSA: hda: Release controller display power during shutdown/reboot ALSA: hda/realtek: Apply LED fixup for HP Dragonfly G1, too ASoC: fsl: remove unnecessary oom message ASoC: tlv320aic32x4: dt-bindings: add TAS2505 to compatible ASoC: tlv320aic32x4: add support for TAS2505 ASoC: tlv320aic32x4: add type to device private data struct ASoC: tegra30: ahub: Use devm_platform_get_and_ioremap_resource() ASoC: tegra: tegra210_admaif: Use devm_platform_get_and_ioremap_resource() ...
2021-07-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds3-2/+9
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...
2021-07-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds7-109/+136
Pull rdma updates from Jason Gunthorpe: "This contains a replacement driver for Intel iWarp hardware. This new driver supports the old ethernet hardware and also newer chips that can do ROCE. Other than that, this contains the typical mix of patches: - Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5 - Many static checker driven code clean ups, including a wide refcount_t conversion - Several series for the hns driver, more HIP09 HW capabilities, migration to new HW register manipulators, and code cleanups - Minor fixes and improvements in srp, rts, and cm - Improvements throughout for sysfs related code to use DEVICE_ATTR_*, make the ib_port sysfs first-class, and overall use sysfs APIs properly - Intel's new irdma driver replacing i40iw - rxe general clean ups and Memory Window support" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (211 commits) RDMA/core: Always release restrack object RDMA/mlx5: Don't access NULL-cleared mpi pointer RDMA/irdma: Fix potential overflow expression in irdma_prm_get_pbles RDMA/irdma: Check contents of user-space irdma_mem_reg_req object RDMA/rxe: Missing unlock on error in get_srq_wqe() RDMA/cma: Fix rdma_resolve_route() memory leak RDMA/core/sa_query: Remove unused argument RDMA/cma: Fix incorrect Packet Lifetime calculation RDMA/cma: Protect RMW with qp_mutex RDMA/cma: Remove unnecessary INIT->INIT transition RDMA/hns: Add window selection field of congestion control RDMA/hfi1: Remove use of kmap() RDMA/irdma: Remove use of kmap() RDMA/bnxt_re: Fix uninitialized struct bit field rsvd1 IB/isert: Align target max I/O size to initiator size RDMA/hns: Fix incorrect vlan enable bit in QPC MAINTAINERS: Update Broadcom RDMA maintainers RDMA/irdma: Use the queried port attributes RDMA/rxe: Fix redundant skb_put_zero RDMA/rxe: Fix extra copy in prepare_ack_packet ...
2021-07-01net: sock: extend SO_TIMESTAMPING for PHC bindingYangbo Lu1-2/+15
Since PTP virtual clock support is added, there can be several PTP virtual clocks based on one PTP physical clock for timestamping. This patch is to extend SO_TIMESTAMPING API to support PHC (PTP Hardware Clock) binding by adding a new flag SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are in use, user space can configure to bind one for timestamping, but PTP physical clock is not supported and not needed to bind. This patch is preparation for timestamp conversion from raw timestamp to a specific PTP virtual clock time in core net. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01ethtool: add a new command for getting PHC virtual clocksYangbo Lu1-0/+15
Add an interface for getting PHC (PTP Hardware Clock) virtual clocks, which are based on PHC physical clock providing hardware timestamp to network packets. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drmLinus Torvalds9-49/+586
Pull drm updates from Dave Airlie: "Highlights: - AMD enables two more GPUs, with resulting header files - i915 has started to move to TTM for discrete GPU and enable DG1 discrete GPU support (not by default yet) - new HyperV drm driver - vmwgfx adds arm64 support - TTM refactoring ongoing - 16bpc display support for AMD hw Otherwise it's just the usual insane amounts of work all over the place in lots of drivers and the core, as mostly summarised below: Core: - mark AGP ioctls as legacy - disable force probing for non-master clients - HDR metadata property helpers - HDMI infoframe signal colorimetry support - remove drm_device.pdev pointer - remove DRM_KMS_FB_HELPER config option - remove drm_pci_alloc/free - drm_err_*/drm_dbg_* helpers - use drm driver names for fbdev - leaked DMA handle fix - 16bpc fixed point format fourcc - add prefetching memcpy for WC - Documentation fixes aperture: - add aperture ownership helpers dp: - aux fixes - downstream 0 port handling - use extended base receiver capability DPCD - Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec - mst: use khz as link rate during init - VCPI fixes for StarTech hub ttm: - provide tt_shrink file via debugfs - warn about freeing pinned BOs - fix swapping error handling - move page alignment into BO - cleanup ttm_agp_backend - add ttm_sys_manager - don't override vm_ops - ttm_bo_mmap removed - make ttm_resource base of all managers - remove VM_MIXEDMAP usage panel: - sysfs_emit support - simple: runtime PM support - simple: power up panel when reading EDID + caching bridge: - MHDP8546: HDCP support + DT bindings - MHDP8546: Register DP AUX channel with userspace - TI SN65DSI83 + SN65DSI84: add driver - Sil8620: Fix module dependencies - dw-hdmi: make CEC driver loading optional - Ti-sn65dsi86: refclk fixes, subdrivers, runtime pm - It66121: Add driver + DT bindings - Adv7511: Support I2S IEC958 encoding - Anx7625: fix power-on delay - Nwi-dsi: Modesetting fixes; Cleanups - lt6911: add missing MODULE_DEVICE_TABLE - cdns: fix PM reference leak hyperv: - add new DRM driver for HyperV graphics efifb: - non-PCI device handling fixes i915: - refactor IP/device versioning - XeLPD Display IP preperation work - ADL-P enablement patches - DG1 uAPI behind BROKEN - disable mmap ioctl for discerte GPUs - start enabling HuC loading for Gen12+ - major GuC backend rework for new platforms - initial TTM support for Discrete GPUs - locking rework for TTM prep - use correct max source link rate for eDP - %p4cc format printing - GLK display fixes - VLV DSI panel power fixes - PSR2 disabled for RKL and ADL-S - ACPI _DSM invalid access fixed - DMC FW path abstraction - ADL-S PCI ID update - uAPI headers converted to kerneldoc - initial LMEM support for DG1 - x86/gpu: add Jasperlake to gen11 early quirks amdgpu: - Aldebaran updates + initial SR-IOV - new GPU: Beige Goby and Yellow Carp support - more LTTPR display work - Vangogh updates - SDMA 5.x GCR fixes - PCIe ASPM support - Renoir TMZ enablement - initial multiple eDP panel support - use fdinfo to track devices/process info - pin/unpin TTM fixes - free resource on fence usage query - fix fence calculation - fix hotunplug/suspend issues - GC/MM register access macro cleanup for SR-IOV - W=1 fixes - ACPI ATCS/ATIF handling rework - 16bpc fixed point format support - Initial smartshift support - RV/PCO power tuning fixes - new INFO query for additional vbios info amdkfd: - SR-IOV aldebaran support - HMM SVM support radeon: - SMU regression fixes - Oland flickering fix vmwgfx: - enable console with fbdev emulation - fix cpu updates of coherent multisample surfaces - remove reservation semaphore - add initial SVGA3 support - support arm64 msm: - devcoredump support for display errors - dpu/dsi: yaml bindings conversion - mdp5: alpha/blend_mode/zpos support - a6xx: cached coherent buffer support - gpu iova fault improvement - a660 support rockchip: - RK3036 win1 scaling support - RK3066/3188 missing register support - RK3036/3066/3126/3188 alpha support mediatek: - MT8167 HDMI support - MT8183 DPI dual edge support tegra: - fixed YUV support/scaling on Tegra186+ ast: - use pcim_iomap - fix DP501 EDID bochs: - screen blanking support etnaviv: - export more GPU ID values to userspace - add HWDB entry for GPU on i.MX8MP - rework linear window calcs exynos: - pm runtime changes imx: - Annotate dma_fence critical section - fix PRG modifiers after drmm conversion - Add 8 pixel alignment fix for 1366x768 - fix YUV advertising - add color properties ingenic: - IPU planes fix panfrost: - Mediatek MT8183 support + DT bindings - export AFBC_FEATURES register to userspace simpledrm: - %pr for printing resources nouveau: - pin/unpin TTM fixes qxl: - unpin shadow BO virtio: - create dumb BOs as guest blob vkms: - drmm_universal_plane_alloc - add XRGB plane composition - overlay support" * tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm: (1570 commits) drm/i915: Reinstate the mmap ioctl for some platforms drm/i915/dsc: abstract helpers to get bigjoiner primary/secondary crtc Revert "drm/msm/mdp5: provide dynamic bandwidth management" drm/msm/mdp5: provide dynamic bandwidth management drm/msm/mdp5: add perf blocks for holding fudge factors drm/msm/mdp5: switch to standard zpos property drm/msm/mdp5: add support for alpha/blend_mode properties drm/msm/mdp5: use drm_plane_state for pixel blend mode drm/msm/mdp5: use drm_plane_state for storing alpha value drm/msm/mdp5: use drm atomic helpers to handle base drm plane state drm/msm/dsi: do not enable PHYs when called for the slave DSI interface drm/msm: Add debugfs to trigger shrinker drm/msm/dpu: Avoid ABBA deadlock between IRQ modules drm/msm: devcoredump iommu fault support iommu/arm-smmu-qcom: Add stall support drm/msm: Improve the a6xx page fault handler iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info iommu/arm-smmu: Add support for driver IOMMU fault handlers drm/msm: export hangcheck_period in debugfs drm/msm/a6xx: add support for Adreno 660 GPU ...
2021-07-01Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-blockLinus Torvalds1-14/+14
Pull io_uring updates from Jens Axboe: - Multi-queue iopoll improvement (Fam) - Allow configurable io-wq CPU masks (me) - renameat/linkat tightening (me) - poll re-arm improvement (Olivier) - SQPOLL race fix (Olivier) - Cancelation unification (Pavel) - SQPOLL cleanups (Pavel) - Enable file backed buffers for shmem/memfd (Pavel) - A ton of cleanups and performance improvements (Pavel) - Followup and misc fixes (Colin, Fam, Hao, Olivier) * tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block: (83 commits) io_uring: code clean for kiocb_done() io_uring: spin in iopoll() only when reqs are in a single queue io_uring: pre-initialise some of req fields io_uring: refactor io_submit_flush_completions io_uring: optimise hot path restricted checks io_uring: remove not needed PF_EXITING check io_uring: mainstream sqpoll task_work running io_uring: refactor io_arm_poll_handler() io_uring: reduce latency by reissueing the operation io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT io_uring: refactor io_openat2() io_uring: simplify struct io_uring_sqe layout io_uring: update sqe layout build checks io_uring: fix code style problems io_uring: refactor io_sq_thread() io_uring: don't change sqpoll creds if not needed io_uring: Create define to modify a SQPOLL parameter io_uring: Fix race condition when sqp thread goes to sleep io_uring: improve in tctx_task_work() resubmission ...
2021-07-01Merge tag 'fs_for_v5.14-rc1' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc fs updates from Jan Kara: "The new quotactl_fd() syscall (remake of quotactl_path() syscall that got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs, isofs, and writeback fixes and cleanups" * tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: fix obtain a reference to a freeing memcg css quota: remove unnecessary oom message isofs: remove redundant continue statement quota: Wire up quotactl_fd syscall quota: Change quotactl_path() systcall to an fd-based one reiserfs: Remove unneed check in reiserfs_write_full_page() udf: Fix NULL pointer dereference in udf_symlink function reiserfs: add check for invalid 1st journal block
2021-07-01Merge branch 'for-next' into for-linusTakashi Iwai1-2/+28
2021-07-01mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand1-0/+3
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policyFeng Tang1-1/+0
MPOL_LOCAL policy has been setup as a real policy, but it is still handled like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit set, and there are many places having to judge the real 'prefer' or the 'local' policy, which are quite confusing. In current code, there are 4 cases that MPOL_LOCAL are used: 1. user specifies 'local' policy 2. user specifies 'prefer' policy, but with empty nodemask 3. system 'default' policy is used 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES flag set, and when it is 'rebind' to a nodemask which doesn't contains the 'preferred' node, it will perform as 'local' policy So make 'local' a real policy instead of a fake 'prefer' one, and kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading. For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal Hocko pointed out: : I do believe that rebinding preferred policy is just bogus and it should : be dropped altogether on the ground that a preference is a mere hint from : userspace where to start the allocation. Unless I am missing something : cpusets will be always authoritative for the final placement. The : preferred node just acts as a starting point and it should be really : preserved when cpusets changes. Otherwise we have a very subtle behavior : corner cases. So dump all the tricky transformation between 'prefer' and 'local', and just record the new nodemask of rebinding. [feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko] Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com [feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal] Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01userfaultfd/shmem: advertise shmem minor fault supportAxel Rasmussen1-1/+6
Now that the feature is fully implemented (the faulting path hooks exist so userspace is notified, and the ioctl to resolve such faults is available), advertise this as a supported feature. Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01Merge tag 'net-next-5.14' of ↵Linus Torvalds21-18/+326
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support" * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits) tcp: change ICSK_CA_PRIV_SIZE definition tcp_yeah: check struct yeah size at compile time gve: DQO: Fix off by one in gve_rx_dqo() stmmac: intel: set PCI_D3hot in suspend stmmac: intel: Enable PHY WOL option in EHL net: stmmac: option to enable PHY WOL with PMT enabled net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del} net: use netdev_info in ndo_dflt_fdb_{add,del} ptp: Set lookup cookie when creating a PTP PPS source. net: sock: add trace for socket errors net: sock: introduce sk_error_report net: dsa: replay the local bridge FDB entries pointing to the bridge dev too net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev net: dsa: include fdb entries pointing to bridge in the host fdb list net: dsa: include bridge addresses which are local in the host fdb list net: dsa: sync static FDB entries on foreign interfaces to hardware net: dsa: install the host MDB and FDB entries in the master's RX filter net: dsa: reference count the FDB addresses at the cross-chip notifier level net: dsa: introduce a separate cross-chip notifier type for host FDBs net: dsa: reference count the MDB entries at the cross-chip notifier level ...
2021-07-01Merge tag 'audit-pr-20210629' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Another merge window, another small audit pull request. Four patches in total: one is cosmetic, one removes an unnecessary initialization, one renames some enum values to prevent name collisions, and one converts list_del()/list_add() to list_move(). None of these are earth shattering and all pass the audit-testsuite tests while merging cleanly on top of your tree from earlier today" * tag 'audit-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: remove unnecessary 'ret' initialization audit: remove trailing spaces and tabs audit: Use list_move instead of list_del/list_add audit: Rename enum audit_state constants to avoid AUDIT_DISABLED redefinition audit: add blank line after variable declarations
2021-07-01nbd: provide a way for userspace processes to identify device backendsPrasanna Kumar Kalever1-0/+1
Problem: On reconfigure of device, there is no way to defend if the backend storage is matching with the initial backend storage. Say, if an initial connect request for backend "pool1/image1" got mapped to /dev/nbd0 and the userspace process is terminated. A next reconfigure request within NBD_ATTR_DEAD_CONN_TIMEOUT is allowed to use /dev/nbd0 for a different backend "pool1/image2" For example, an operation like below could be dangerous: $ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image /dev/nbd0 $ sudo blkid /dev/nbd0 /dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4" $ sudo pkill -9 rbd-nbd $ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image /dev/nbd0 $ sudo blkid /dev/nbd0 /dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs" Solution: Provide a way for userspace processes to keep some metadata to identify between the device and the backend, so that when a reconfigure request is made, we can compare and avoid such dangerous operations. With this solution, as part of the initial connect request, backend path can be stored in the sysfs per device config, so that on a reconfigure request it's easy to check if the backend path matches with the initial connect backend path. Please note, ioctl interface to nbd will not have these changes, as there won't be any reconfigure. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210429102828.31248-1-prasanna.kalever@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30io_uring: simplify struct io_uring_sqe layoutPavel Begunkov1-14/+10
Flatten struct io_uring_sqe, the last union is exactly 64B, so move them out of union { struct { ... }}, and decrease __pad2 size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e21ef7aed136293d654450bc3088973a8adc730.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30Merge tag 'platform-drivers-x86-v5.14-1' of ↵Linus Torvalds1-2/+71
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver updates from Hans de Goede: "Highlights: - New think-lmi driver adding support for changing Lenovo Thinkpad BIOS settings from within Linux using the standard firmware- attributes class sysfs API - MS Surface aggregator-cdev now also supports forwarding events to user-space (for debugging / new driver development purposes only) - New intel_skl_int3472 driver this provides the necessary glue to translate ACPI table information to GPIOs, regulators, etc. for camera sensors on Intel devices with IPU3 attached MIPI cameras - A whole bunch of other fixes + device-specific quirk additions - New devm_work_autocancel() devm-helpers.h function" * tag 'platform-drivers-x86-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (83 commits) platform/x86: dell-wmi-sysman: Change user experience when Admin/System Password is modified platform/x86: intel_skl_int3472: Uninitialized variable in skl_int3472_handle_gpio_resources() platform/x86: think-lmi: Move kfree(setting->possible_values) to tlmi_attr_setting_release() platform/x86: think-lmi: Split current_value to reflect only the value platform/x86: think-lmi: Fix issues with duplicate attributes platform/x86: think-lmi: Return EINVAL when kbdlang gets set to a 0 length string platform/x86: intel_cht_int33fe: Move to its own subfolder platform/x86: intel_skl_int3472: Move to intel/ subfolder platform/x86: intel_skl_int3472: Provide skl_int3472_unregister_clock() platform/x86: intel_skl_int3472: Provide skl_int3472_unregister_regulator() platform/x86: intel_skl_int3472: Use ACPI GPIO resource directly platform/x86: intel_skl_int3472: Fix dependencies (drop CLKDEV_LOOKUP) platform/x86: intel_skl_int3472: Free ACPI device resources after use platform/x86: Remove "default n" entries platform/x86: ISST: Use numa node id for cpu pci dev mapping platform/x86: ISST: Optimize CPU to PCI device mapping tools/power/x86/intel-speed-select: v1.10 release tools/power/x86/intel-speed-select: Fix uncore memory frequency display extcon: extcon-max8997: Simplify driver using devm extcon: extcon-max8997: Fix IRQ freeing at error path ...
2021-06-30Merge tag 'fs.mount_setattr.nosymfollow.v5.14' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr updates from Christian Brauner: "A few releases ago the old mount API gained support for a mount options which prevents following symlinks on a given mount. This adds support for it in the new mount api through the MOUNT_ATTR_NOSYMFOLLOW flag via mount_setattr() and fsmount(). With mount_setattr() that flag can even be applied recursively. There's an additional ack from Ross Zwisler who originally authored the nosymfollow patch. As I've already had the patches in my for-next I didn't add his ack explicitly" * tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: test MOUNT_ATTR_NOSYMFOLLOW with mount_setattr() mount: Support "nosymfollow" in new mount api
2021-06-29Merge tag 'seccomp-v5.14-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - Add "atomic addfd + send reply" mode to SECCOMP_USER_NOTIF to better handle EINTR races visible to seccomp monitors. (Rodrigo Campos, Sargun Dhillon) - Improve seccomp selftests for readability in CI systems. (Kees Cook) * tag 'seccomp-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests/seccomp: Avoid using "sysctl" for report selftests/seccomp: Flush benchmark output selftests/seccomp: More closely track fds being assigned selftests/seccomp: Add test for atomic addfd+send seccomp: Support atomic "addfd + send reply"
2021-06-29Merge tag 'for-5.14-tag' of ↵Linus Torvalds2-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "A normal mix of improvements, core changes and features that user have been missing or complaining about. User visible changes: - new sysfs exports: - add sysfs knob to limit scrub IO bandwidth per device - device stats are also available in /sys/fs/btrfs/FSID/devinfo/DEVID/error_stats - support cancellable resize and device delete ioctls - change how the empty value is interpreted when setting a property, so far we have only 'btrfs.compression' and we need to distinguish a reset to defaults and setting "do not compress", in general the empty value will always mean 'reset to defaults' for any other property, for compression it's either 'no' or 'none' to forbid compression Performance improvements: - no need for full sync when truncation does not touch extents, reported run time change is -12% - avoid unnecessary logging of xattrs during fast fsyncs (+17% throughput, -17% runtime on xattr stress workload) Core: - preemptive flushing improvements and fixes - adjust clamping logic on multi-threaded workloads to avoid flushing too soon - take into account global block reserve, may help on almost full filesystems - continue flushing when there are enough pending delalloc and ordered bytes - simplify logic around conditional transaction commit, a workaround used in the past for throttling that's been superseded by ticket reservations that manage the throttling in a better way - subpage blocksize preparation: - submit read time repair only for each corrupted sector - scrub repair now works with sectors and not pages - free space cache (v1) works with sectors and not pages - more fine grained bio tracking for extents - subpage support in page callbacks, extent callbacks, end io callbacks - simplify transaction abort logic and always abort and don't check various potentially unreliable stats tracked by the transaction - exclusive operations can do more checks when started and allow eg. cancellation of the same running operation - ensure relocation never runs while we have send operations running, e.g. when zoned background auto reclaim starts Fixes: - zoned: more sanity checks of write pointer - improve error handling in delayed inodes - send: - fix invalid path for unlink operations after parent orphanization - fix crash when memory allocations trigger reclaim - skip compression of we have only one page (can't make things better) - empty value of a property newly means reset to default Other: - lots of cleanups, comment updates, yearly typo fixing - disable build on platforms having page size 256K" * tag 'for-5.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (101 commits) btrfs: remove unused btrfs_fs_info::total_pinned btrfs: rip out btrfs_space_info::total_bytes_pinned btrfs: rip the first_ticket_bytes logic from fail_all_tickets btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing btrfs: rip out may_commit_transaction btrfs: send: fix crash when memory allocations trigger reclaim btrfs: ensure relocation never runs while we have send operations running btrfs: shorten integrity checker extent data mount option btrfs: switch mount option bits to enums and use wider type btrfs: props: change how empty value is interpreted btrfs: compression: don't try to compress if we don't have enough pages btrfs: fix unbalanced unlock in qgroup_account_snapshot() btrfs: sysfs: export dev stats in devinfo directory btrfs: fix typos in comments btrfs: remove a stale comment for btrfs_decompress_bio() btrfs: send: use list_move_tail instead of list_del/list_add_tail btrfs: disable build on platforms having page size 256K btrfs: send: fix invalid path for unlink operations after parent orphanization btrfs: inline wait_current_trans_commit_start in its caller btrfs: sink wait_for_unblock parameter to async commit ...
2021-06-29Merge tag 'media/v5.14-1' of ↵Linus Torvalds7-515/+132
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - V4L2 core control API was split into separate files - New RC maps: tango and tc-90405 - Hantro driver got support for G2/HEVC decoder - av7710 is moving to staging, together with some legacy APIs - several cleanups related to compat_ioctl32 code - Move the MPEG-2 stateless control type out of staging - Address several issues with RPM get logic on media drivers - Lots of cleanups, bug fixes and improvements. * tag 'media/v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (394 commits) media: s5p-mfc: Fix display delay control creation media: mtk-vpu: on suspend, read/write regs only if vpu is running media: video-mux: Skip dangling endpoints media: Fix Media Controller API config checks media: i2c: rdacm20: Re-work ov10635 reset media: i2c: rdacm20: Check return values media: i2c: rdacm20: Report camera module name media: i2c: rdacm20: Enable noise immunity media: i2c: rdacm20: Embed 'serializer' field media: i2c: rdacm21: Power up OV10640 before OV490 media: i2c: rdacm21: Fix OV10640 powerup media: i2c: rdacm21: Add delay after OV490 reset media: i2c: max9271: Introduce wake_up() function media: i2c: max9271: Check max9271_write() return media: i2c: max9286: Rework comments in .bound() media: i2c: max9286: Define high channel amplitude media: i2c: max9286: Cache channel amplitude media: i2c: max9286: Rename reverse_channel_mv media: i2c: max9286: Adjust parameters indent media: hantro: add support for Rockchip RK3036 ...
2021-06-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-0/+106
Pull kvm updates from Paolo Bonzini: "This covers all architectures (except MIPS) so I don't expect any other feature pull requests this merge window. ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits) KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled kvm: x86: disable the narrow guest module parameter on unload selftests: kvm: Allows userspace to handle emulation errors. kvm: x86: Allow userspace to handle emulation errors KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic KVM: x86: Enhance comments for MMU roles and nested transition trickiness KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU KVM: x86/mmu: Use MMU's role to determine PTTYPE KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers KVM: x86/mmu: Add a helper to calculate root from role_regs KVM: x86/mmu: Add helper to update paging metadata KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper KVM: x86/mmu: Get nested MMU's root level from the MMU's role ...
2021-06-28Merge tag 'mac80211-next-for-net-next-2021-06-25' of ↵David S. Miller1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes berg says: ==================== Lots of changes: * aggregation handling improvements for some drivers * hidden AP discovery on 6 GHz and other HE 6 GHz improvements * minstrel improvements for no-ack frames * deferred rate control for TXQs to improve reaction times * virtual time-based airtime scheduler * along with various little cleanups/fixups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28seccomp: Support atomic "addfd + send reply"Rodrigo Campos1-0/+1
Alban Crequy reported a race condition userspace faces when we want to add some fds and make the syscall return them[1] using seccomp notify. The problem is that currently two different ioctl() calls are needed by the process handling the syscalls (agent) for another userspace process (target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible for the agent to do the first ioctl to add a file descriptor but the target is interrupted (EINTR) before the agent does the second ioctl() call. This patch adds a flag to the ADDFD ioctl() so it adds the fd and returns that value atomically to the target program, as suggested by Kees Cook[2]. This is done by simply allowing seccomp_do_user_notification() to add the fd and return it in this case. Therefore, in this case the target wakes up from the wait in seccomp_do_user_notification() either to interrupt the syscall or to add the fd and return it. This "allocate an fd and return" functionality is useful for syscalls that return a file descriptor only, like connect(2). Other syscalls that return a file descriptor but not as return value (or return more than one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not work with this flag. This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's return value, nor error can be set by the user. Upon successful invocation of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND flag, the notifying process's errno will be 0, and the return value will be the file descriptor number that was installed. [1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/ Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me
2021-06-28Merge tag 'sched-core-2021-06-28' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler udpates from Ingo Molnar: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There are new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/doc: Update the CPU capacity asymmetry bits sched/topology: Rework CPU capacity asymmetry detection sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag psi: Fix race between psi_trigger_create/destroy sched/fair: Introduce the burstable CFS controller sched/uclamp: Fix uclamp_tg_restrict() sched/rt: Fix Deadline utilization tracking during policy change sched/rt: Fix RT utilization tracking during policy change sched: Change task_struct::state sched,arch: Remove unused TASK_STATE offsets sched,timer: Use __set_current_state() sched: Add get_current_state() sched,perf,kvm: Fix preemption condition sched: Introduce task_is_running() sched: Unbreak wakeups sched/fair: Age the average idle time sched/cpufreq: Consider reduced CPU capacity in energy calculation sched/fair: Take thermal pressure into account while estimating energy thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure sched/fair: Return early from update_tg_cfs_load() if delta == 0 ...
2021-06-28Merge tag 'locking-core-2021-06-28' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Core locking & atomics: - Convert all architectures to ARCH_ATOMIC: move every architecture to ARCH_ATOMIC, then get rid of ARCH_ATOMIC and all the transitory facilities and #ifdefs. Much reduction in complexity from that series: 63 files changed, 756 insertions(+), 4094 deletions(-) - Self-test enhancements - Futexes: - Add the new FUTEX_LOCK_PI2 ABI, which is a variant that doesn't set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC). [ The temptation to repurpose FUTEX_LOCK_PI's implicit setting of FLAGS_CLOCKRT & invert the flag's meaning to avoid having to introduce a new variant was resisted successfully. ] - Enhance futex self-tests - Lockdep: - Fix dependency path printouts - Optimize trace saving - Broaden & fix wait-context checks - Misc cleanups and fixes. * tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) locking/lockdep: Correct the description error for check_redundant() futex: Provide FUTEX_LOCK_PI2 to support clock selection futex: Prepare futex_lock_pi() for runtime clock selection lockdep/selftest: Remove wait-type RCU_CALLBACK tests lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING lockdep: Fix wait-type for empty stack locking/selftests: Add a selftest for check_irq_usage() lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage() locking/lockdep: Remove the unnecessary trace saving locking/lockdep: Fix the dep path printing for backwards BFS selftests: futex: Add futex compare requeue test selftests: futex: Add futex wait test seqlock: Remove trailing semicolon in macros locking/lockdep: Reduce LOCKDEP dependency list locking/lockdep,doc: Improve readability of the block matrix locking/atomics: atomic-instrumented: simplify ifdeffery locking/atomic: delete !ARCH_ATOMIC remnants locking/atomic: xtensa: move to ARCH_ATOMIC locking/atomic: sparc: move to ARCH_ATOMIC locking/atomic: sh: move to ARCH_ATOMIC ...
2021-06-25userfaultfd: uapi: fix UFFDIO_CONTINUE ioctl request definitionGleb Fotengauer-Malinovskiy1-2/+2
This ioctl request reads from uffdio_continue structure written by userspace which justifies _IOC_WRITE flag. It also writes back to that structure which justifies _IOC_READ flag. See NOTEs in include/uapi/asm-generic/ioctl.h for more information. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25Merge tag 'kvmarm-5.14' of ↵Paolo Bonzini5-11/+45
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for v5.14. - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes
2021-06-25kvm: x86: Allow userspace to handle emulation errorsAaron Lewis1-0/+23
Add a fallback mechanism to the in-kernel instruction emulator that allows userspace the opportunity to process an instruction the emulator was unable to. When the in-kernel instruction emulator fails to process an instruction it will either inject a #UD into the guest or exit to userspace with exit reason KVM_INTERNAL_ERROR. This is because it does not know how to proceed in an appropriate manner. This feature lets userspace get involved to see if it can figure out a better path forward. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210510144834.658457-2-aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24net: retrieve netns cookie via getsocketoptMartynas Pumputis1-0/+2
It's getting more common to run nested container environments for testing cloud software. One of such examples is Kind [1] which runs a Kubernetes cluster in Docker containers on a single host. Each container acts as a Kubernetes node, and thus can run any Pod (aka container) inside the former. This approach simplifies testing a lot, as it eliminates complicated VM setups. Unfortunately, such a setup breaks some functionality when cgroupv2 BPF programs are used for load-balancing. The load-balancer BPF program needs to detect whether a request originates from the host netns or a container netns in order to allow some access, e.g. to a service via a loopback IP address. Typically, the programs detect this by comparing netns cookies with the one of the init ns via a call to bpf_get_netns_cookie(NULL). However, in nested environments the latter cannot be used given the Kubernetes node's netns is outside the init ns. To fix this, we need to pass the Kubernetes node netns cookie to the program in a different way: by extending getsockopt() with a SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes node netns can retrieve the cookie and pass it to the program instead. Thus, this is following up on Eric's commit 3d368ab87cf6 ("net: initialize net->net_cookie at netns setup") to allow retrieval via SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie via SO_COOKIE. [1] https://kind.sigs.k8s.io/ Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24KVM: stats: Add fd-based API to read binary stats dataJing Zhang1-0/+73
This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24Merge tag 'asoc-fix-v5.13-rc7' of ↵Takashi Iwai5-12/+44
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-next ASoC: Fixes for v5.13 A final batch of fixes for v5.13, this is larger than I'd like due to the fixes for a series of suspend issues that Intel turned up in their testing this week.
2021-06-24Merge tag 'drm-msm-next-2021-06-23b' of ↵Dave Airlie1-4/+3
https://gitlab.freedesktop.org/drm/msm into drm-next * devcoredump support for display errors * dpu: irq cleanup/refactor * dpu: dt bindings conversion to yaml * dsi: dt bindings conversion to yaml * mdp5: alpha/blend_mode/zpos support * a6xx: cached coherent buffer support * a660 support * gpu iova fault improvements: - info about which block triggered the fault, etc - generation of gpu devcoredump on fault * assortment of other cleanups and fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGs4=qsGBBbyn-4JWqW4-YUSTKh67X3DsPQ=T2D9aXKqNA@mail.gmail.com
2021-06-23tcp: Add stats for socket migration.Kuniyuki Iwashima1-0/+2
This commit adds two stats for the socket migration feature to evaluate the effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE). If the migration fails because of the own_req race in receiving ACK and sending SYN+ACK paths, we do not increment the failure stat. Then another CPU is responsible for the req. Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/ Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>