diff options
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/boot.rst | 6 | ||||
-rw-r--r-- | Documentation/x86/cpuinfo.rst | 155 | ||||
-rw-r--r-- | Documentation/x86/index.rst | 2 | ||||
-rw-r--r-- | Documentation/x86/sva.rst | 257 |
4 files changed, 417 insertions, 3 deletions
diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst index 7fafc7ac00d7..abb9fc164657 100644 --- a/Documentation/x86/boot.rst +++ b/Documentation/x86/boot.rst @@ -1342,8 +1342,8 @@ follow:: In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should -also fill the additional fields of the struct boot_params as that -described in zero-page.txt. +also fill the additional fields of the struct boot_params as +described in chapter :doc:`zero-page`. After setting up the struct boot_params, the boot loader can load the 32/64-bit kernel in the same way as that of 16-bit boot protocol. @@ -1379,7 +1379,7 @@ can be calculated as follows:: In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should also fill the additional fields of the struct boot_params as described -in zero-page.txt. +in chapter :doc:`zero-page`. After setting up the struct boot_params, the boot loader can load 64-bit kernel in the same way as that of 16-bit boot protocol, but diff --git a/Documentation/x86/cpuinfo.rst b/Documentation/x86/cpuinfo.rst new file mode 100644 index 000000000000..5d54c39a063f --- /dev/null +++ b/Documentation/x86/cpuinfo.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +x86 Feature Flags +================= + +Introduction +============ + +On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition +in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature +or KVM want to expose the feature to a KVM guest, it can and should have +an X86_FEATURE_* defined. These flags represent hardware features as +well as software features. + +If users want to know if a feature is available on a given system, they +try to find the flag in /proc/cpuinfo. If a given flag is present, it +means that the kernel supports it and is currently making it available. +If such flag represents a hardware feature, it also means that the +hardware supports it. + +If the expected flag does not appear in /proc/cpuinfo, things are murkier. +Users need to find out the reason why the flag is missing and find the way +how to enable it, which is not always easy. There are several factors that +can explain missing flags: the expected feature failed to enable, the feature +is missing in hardware, platform firmware did not enable it, the feature is +disabled at build or run time, an old kernel is in use, or the kernel does +not support the feature and thus has not enabled it. In general, /proc/cpuinfo +shows features which the kernel supports. For a full list of CPUID flags +which the CPU supports, use tools/arch/x86/kcpuid. + +How are feature flags created? +============================== + +a: Feature flags can be derived from the contents of CPUID leaves. +------------------------------------------------------------------ +These feature definitions are organized mirroring the layout of CPUID +leaves and grouped in words with offsets as mapped in enum cpuid_leafs +in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details). +If a feature is defined with a X86_FEATURE_<name> definition in +cpufeatures.h, and if it is detected at run time, the flags will be +displayed accordingly in /proc/cpuinfo. For example, the flag "avx2" +comes from X86_FEATURE_AVX2 in cpufeatures.h. + +b: Flags can be from scattered CPUID-based features. +---------------------------------------------------- +Hardware features enumerated in sparsely populated CPUID leaves get +software-defined values. Still, CPUID needs to be queried to determine +if a given feature is present. This is done in init_scattered_cpuid_features(). +For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is +checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1]. + +The intent of scattering CPUID leaves is to not bloat struct +cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf +[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1] +has only one feature and would waste 31 bits of space in the x86_capability[] +array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted +memory is not trivial. + +c: Flags can be created synthetically under certain conditions for hardware features. +------------------------------------------------------------------------------------- +Examples of conditions include whether certain features are present in +MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed +conditions are met, the features are enabled by the set_cpu_cap or +setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, +the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and +"split_lock_detect" will be displayed. The flag "ring3mwait" will be +displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. + +d: Flags can represent purely software features. +------------------------------------------------ +These flags do not represent hardware features. Instead, they represent a +software feature implemented in the kernel. For example, Kernel Page Table +Isolation is purely software feature and its feature flag X86_FEATURE_PTI is +also defined in cpufeatures.h. + +Naming of Flags +=============== + +The script arch/x86/kernel/cpu/mkcapflags.sh processes the +#define X86_FEATURE_<name> from cpufeatures.h and generates the +x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the +resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming +of flags in the x86_cap/bug_flags[] are as follows: + +a: The name of the flag is from the string in X86_FEATURE_<name> by default. +---------------------------------------------------------------------------- +By default, the flag <name> in /proc/cpuinfo is extracted from the respective +X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from +X86_FEATURE_AVX2. + +b: The naming can be overridden. +-------------------------------- +If the comment on the line for the #define X86_FEATURE_* starts with a +double-quote character (""), the string inside the double-quote characters +will be the name of the flags. For example, the flag "sse4_1" comes from +the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition. + +There are situations in which overriding the displayed name of the flag is +needed. For instance, /proc/cpuinfo is a userspace interface and must remain +constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one +shall override the new naming with the name already used in /proc/cpuinfo. + +c: The naming override can be "", which means it will not appear in /proc/cpuinfo. +---------------------------------------------------------------------------------- +The feature shall be omitted from /proc/cpuinfo if it does not make sense for +the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is +defined in cpufeatures.h but that flag is an internal kernel feature used +in the alternative runtime patching functionality. So, its name is overridden +with "". Its flag will not appear in /proc/cpuinfo. + +Flags are missing when one or more of these happen +================================================== + +a: The hardware does not enumerate support for it. +-------------------------------------------------- +For example, when a new kernel is running on old hardware or the feature is +not enabled by boot firmware. Even if the hardware is new, there might be a +problem enabling the feature at run time, the flag will not be displayed. + +b: The kernel does not know about the flag. +------------------------------------------- +For example, when an old kernel is running on new hardware. + +c: The kernel disabled support for it at compile-time. +------------------------------------------------------ +For example, if 5-level-paging is not enabled when building (i.e., +CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_. +Even though the feature will still be detected via CPUID, the kernel disables +it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57). + +d: The feature is disabled at boot-time. +---------------------------------------- +A feature can be disabled either using a command-line parameter or because +it failed to be enabled. The command-line parameter clearcpuid= can be used +to disable features using the feature number as defined in +/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction +Protection can be disabled using clearcpuid=514. The number 514 is calculated +from #define X86_FEATURE_UMIP (16*32 + 2). + +In addition, there exists a variety of custom command-line parameters that +disable specific features. The list of parameters includes, but is not limited +to, nofsgsbase, nosmap, and nosmep. 5-level paging can also be disabled using +"no5lvl". SMAP and SMEP are disabled with the aforementioned parameters, +respectively. + +e: The feature was known to be non-functional. +---------------------------------------------- +The feature was known to be non-functional because a dependency was +missing at runtime. For example, AVX flags will not show up if XSAVE feature +is disabled since they depend on XSAVE feature. Another example would be broken +CPUs and them missing microcode patches. Due to that, the kernel decides not to +enable a feature. + +.. [#f1] 5-level paging uses linear address of 57 bits. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 265d9e9a093b..740ee7f87898 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -9,6 +9,7 @@ x86-specific Documentation :numbered: boot + cpuinfo topology exception-tables kernel-stacks @@ -30,3 +31,4 @@ x86-specific Documentation usb-legacy-support i386/index x86_64/index + sva diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst new file mode 100644 index 000000000000..076efd51ef1f --- /dev/null +++ b/Documentation/x86/sva.rst @@ -0,0 +1,257 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Shared Virtual Addressing (SVA) with ENQCMD +=========================================== + +Background +========== + +Shared Virtual Addressing (SVA) allows the processor and device to use the +same virtual addresses avoiding the need for software to translate virtual +addresses to physical addresses. SVA is what PCIe calls Shared Virtual +Memory (SVM). + +In addition to the convenience of using application virtual addresses +by the device, it also doesn't require pinning pages for DMA. +PCIe Address Translation Services (ATS) along with Page Request Interface +(PRI) allow devices to function much the same way as the CPU handling +application page-faults. For more information please refer to the PCIe +specification Chapter 10: ATS Specification. + +Use of SVA requires IOMMU support in the platform. IOMMU is also +required to support the PCIe features ATS and PRI. ATS allows devices +to cache translations for virtual addresses. The IOMMU driver uses the +mmu_notifier() support to keep the device TLB cache and the CPU cache in +sync. When an ATS lookup fails for a virtual address, the device should +use the PRI in order to request the virtual address to be paged into the +CPU page tables. The device must use ATS again in order the fetch the +translation before use. + +Shared Hardware Workqueues +========================== + +Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits +the use of Shared Work Queues (SWQ) by both applications and Virtual +Machines (VM's). This allows better hardware utilization vs. hard +partitioning resources that could result in under utilization. In order to +allow the hardware to distinguish the context for which work is being +executed in the hardware by SWQ interface, SIOV uses Process Address Space +ID (PASID), which is a 20-bit number defined by the PCIe SIG. + +PASID value is encoded in all transactions from the device. This allows the +IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe +Resource Identifier (RID) which is the Bus/Device/Function. + + +ENQCMD +====== + +ENQCMD is a new instruction on Intel platforms that atomically submits a +work descriptor to a device. The descriptor includes the operation to be +performed, virtual addresses of all parameters, virtual address of a completion +record, and the PASID (process address space ID) of the current process. + +ENQCMD works with non-posted semantics and carries a status back if the +command was accepted by hardware. This allows the submitter to know if the +submission needs to be retried or other device specific mechanisms to +implement fairness or ensure forward progress should be provided. + +ENQCMD is the glue that ensures applications can directly submit commands +to the hardware and also permits hardware to be aware of application context +to perform I/O operations via use of PASID. + +Process Address Space Tagging +============================= + +A new thread-scoped MSR (IA32_PASID) provides the connection between +user processes and the rest of the hardware. When an application first +accesses an SVA-capable device, this MSR is initialized with a newly +allocated PASID. The driver for the device calls an IOMMU-specific API +that sets up the routing for DMA and page-requests. + +For example, the Intel Data Streaming Accelerator (DSA) uses +iommu_sva_bind_device(), which will do the following: + +- Allocate the PASID, and program the process page-table (%cr3 register) in the + PASID context entries. +- Register for mmu_notifier() to track any page-table invalidations to keep + the device TLB in sync. For example, when a page-table entry is invalidated, + the IOMMU propagates the invalidation to the device TLB. This will force any + future access by the device to this virtual address to participate in + ATS. If the IOMMU responds with proper response that a page is not + present, the device would request the page to be paged in via the PCIe PRI + protocol before performing I/O. + +This MSR is managed with the XSAVE feature set as "supervisor state" to +ensure the MSR is updated during context switch. + +PASID Management +================ + +The kernel must allocate a PASID on behalf of each process which will use +ENQCMD and program it into the new MSR to communicate the process identity to +platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests +from this process. When a user submits a work descriptor to a device using the +ENQCMD instruction, the PASID field in the descriptor is auto-filled with the +value from MSR_IA32_PASID. Requests for DMA from the device are also tagged +with the same PASID. The platform IOMMU uses the PASID in the transaction to +perform address translation. The IOMMU APIs setup the corresponding PASID +entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in +x86). + +The MSR must be configured on each logical CPU before any application +thread can interact with a device. Threads that belong to the same +process share the same page tables, thus the same MSR value. + +PASID is cleared when a process is created. The PASID allocation and MSR +programming may occur long after a process and its threads have been created. +One thread must call iommu_sva_bind_device() to allocate the PASID for the +process. If a thread uses ENQCMD without the MSR first being populated, a #GP +will be raised. The kernel will update the PASID MSR with the PASID for all +threads in the process. A single process PASID can be used simultaneously +with multiple devices since they all share the same address space. + +One thread can call iommu_sva_unbind_device() to free the allocated PASID. +The kernel will clear the PASID MSR for all threads belonging to the process. + +New threads inherit the MSR value from the parent. + +Relationships +============= + + * Each process has many threads, but only one PASID. + * Devices have a limited number (~10's to 1000's) of hardware workqueues. + The device driver manages allocating hardware workqueues. + * A single mmap() maps a single hardware workqueue as a "portal" and + each portal maps down to a single workqueue. + * For each device with which a process interacts, there must be + one or more mmap()'d portals. + * Many threads within a process can share a single portal to access + a single device. + * Multiple processes can separately mmap() the same portal, in + which case they still share one device hardware workqueue. + * The single process-wide PASID is used by all threads to interact + with all devices. There is not, for instance, a PASID for each + thread or each thread<->device pair. + +FAQ +=== + +* What is SVA/SVM? + +Shared Virtual Addressing (SVA) permits I/O hardware and the processor to +work in the same address space, i.e., to share it. Some call it Shared +Virtual Memory (SVM), but Linux community wanted to avoid confusing it with +POSIX Shared Memory and Secure Virtual Machines which were terms already in +circulation. + +* What is a PASID? + +A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet +(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. +PASID is included in all transactions between the platform and the device. + +* How are shared workqueues different? + +Traditionally, in order for userspace applications to interact with hardware, +there is a separate hardware instance required per process. For example, +consider doorbells as a mechanism of informing hardware about work to process. +Each doorbell is required to be spaced 4k (or page-size) apart for process +isolation. This requires hardware to provision that space and reserve it in +MMIO. This doesn't scale as the number of threads becomes quite large. The +hardware also manages the queue depth for Shared Work Queues (SWQ), and +consumers don't need to track queue depth. If there is no space to accept +a command, the device will return an error indicating retry. + +A user should check Deferrable Memory Write (DMWr) capability on the device +and only submits ENQCMD when the device supports it. In the new DMWr PCIe +terminology, devices need to support DMWr completer capability. In addition, +it requires all switch ports to support DMWr routing and must be enabled by +the PCIe subsystem, much like how PCIe atomic operations are managed for +instance. + +SWQ allows hardware to provision just a single address in the device. When +used with ENQCMD to submit work, the device can distinguish the process +submitting the work since it will include the PASID assigned to that +process. This helps the device scale to a large number of processes. + +* Is this the same as a user space device driver? + +Communicating with the device via the shared workqueue is much simpler +than a full blown user space driver. The kernel driver does all the +initialization of the hardware. User space only needs to worry about +submitting work and processing completions. + +* Is this the same as SR-IOV? + +Single Root I/O Virtualization (SR-IOV) focuses on providing independent +hardware interfaces for virtualizing hardware. Hence, it's required to be +almost fully functional interface to software supporting the traditional +BARs, space for interrupts via MSI-X, its own register layout. +Virtual Functions (VFs) are assisted by the Physical Function (PF) +driver. + +Scalable I/O Virtualization builds on the PASID concept to create device +instances for virtualization. SIOV requires host software to assist in +creating virtual devices; each virtual device is represented by a PASID +along with the bus/device/function of the device. This allows device +hardware to optimize device resource creation and can grow dynamically on +demand. SR-IOV creation and management is very static in nature. Consult +references below for more details. + +* Why not just create a virtual function for each app? + +Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require +duplicated hardware for PCI config space and interrupts such as MSI-X. +Resources such as interrupts have to be hard partitioned between VFs at +creation time, and cannot scale dynamically on demand. The VFs are not +completely independent from the Physical Function (PF). Most VFs require +some communication and assistance from the PF driver. SIOV, in contrast, +creates a software-defined device where all the configuration and control +aspects are mediated via the slow path. The work submission and completion +happen without any mediation. + +* Does this support virtualization? + +ENQCMD can be used from within a guest VM. In these cases, the VMM helps +with setting up a translation table to translate from Guest PASID to Host +PASID. Please consult the ENQCMD instruction set reference for more +details. + +* Does memory need to be pinned? + +When devices support SVA along with platform hardware such as IOMMU +supporting such devices, there is no need to pin memory for DMA purposes. +Devices that support SVA also support other PCIe features that remove the +pinning requirement for memory. + +Device TLB support - Device requests the IOMMU to lookup an address before +use via Address Translation Service (ATS) requests. If the mapping exists +but there is no page allocated by the OS, IOMMU hardware returns that no +mapping exists. + +Device requests the virtual address to be mapped via Page Request +Interface (PRI). Once the OS has successfully completed the mapping, it +returns the response back to the device. The device requests again for +a translation and continues. + +IOMMU works with the OS in managing consistency of page-tables with the +device. When removing pages, it interacts with the device to remove any +device TLB entry that might have been cached before removing the mappings from +the OS. + +References +========== + +VT-D: +https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d + +SIOV: +https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux + +ENQCMD in ISE: +https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf + +DSA spec: +https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf |