diff options
Diffstat (limited to 'Documentation/admin-guide')
23 files changed, 1166 insertions, 40 deletions
diff --git a/Documentation/admin-guide/acpi/cppc_sysfs.rst b/Documentation/admin-guide/acpi/cppc_sysfs.rst new file mode 100644 index 000000000000..a4b99afbe331 --- /dev/null +++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst @@ -0,0 +1,76 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================== +Collaborative Processor Performance Control (CPPC) +================================================== + +CPPC +==== + +CPPC defined in the ACPI spec describes a mechanism for the OS to manage the +performance of a logical processor on a contigious and abstract performance +scale. CPPC exposes a set of registers to describe abstract performance scale, +to request performance levels and to measure per-cpu delivered performance. + +For more details on CPPC please refer to the ACPI specification at: + +http://uefi.org/specifications + +Some of the CPPC registers are exposed via sysfs under:: + + /sys/devices/system/cpu/cpuX/acpi_cppc/ + +for each cpu X:: + + $ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/ + /sys/devices/system/cpu/cpu0/acpi_cppc/: + total 0 + -r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs + -r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time + +* highest_perf : Highest performance of this processor (abstract scale). +* nominal_perf : Highest sustained performance of this processor + (abstract scale). +* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear + power savings (abstract scale). +* lowest_perf : Lowest performance of this processor (abstract scale). + +* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz). +* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz). + The above frequencies should only be used to report processor performance in + freqency instead of abstract scale. These values should not be used for any + functional decisions. + +* feedback_ctrs : Includes both Reference and delivered performance counter. + Reference counter ticks up proportional to processor's reference performance. + Delivered counter ticks up proportional to processor's delivered performance. +* wraparound_time: Minimum time for the feedback counters to wraparound + (seconds). +* reference_perf : Performance level at which reference performance counter + accumulates (abstract scale). + + +Computing Average Delivered Performance +======================================= + +Below describes the steps to compute the average performance delivered by +taking two different snapshots of feedback counters at time T1 and T2. + + T1: Read feedback_ctrs as fbc_t1 + Wait or run some workload + + T2: Read feedback_ctrs as fbc_t2 + +:: + + delivered_counter_delta = fbc_t2[del] - fbc_t1[del] + reference_counter_delta = fbc_t2[ref] - fbc_t1[ref] + + delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst new file mode 100644 index 000000000000..50bd7f194bf4 --- /dev/null +++ b/Documentation/admin-guide/acpi/dsdt-override.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Overriding DSDT +=============== + +Linux supports a method of overriding the BIOS DSDT: + +CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel. + +When to use this method is described in detail on the +Linux/ACPI home page: +https://01.org/linux-acpi/documentation/overriding-dsdt diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst new file mode 100644 index 000000000000..4d13eeea1eca --- /dev/null +++ b/Documentation/admin-guide/acpi/index.rst @@ -0,0 +1,14 @@ +============ +ACPI Support +============ + +Here we document in detail how to interact with various mechanisms in +the Linux ACPI support. + +.. toctree:: + :maxdepth: 1 + + initrd_table_override + dsdt-override + ssdt-overlays + cppc_sysfs diff --git a/Documentation/admin-guide/acpi/initrd_table_override.rst b/Documentation/admin-guide/acpi/initrd_table_override.rst new file mode 100644 index 000000000000..cbd768207631 --- /dev/null +++ b/Documentation/admin-guide/acpi/initrd_table_override.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Upgrading ACPI tables via initrd +================================ + +What is this about +================== + +If the ACPI_TABLE_UPGRADE compile option is true, it is possible to +upgrade the ACPI execution environment that is defined by the ACPI tables +via upgrading the ACPI tables provided by the BIOS with an instrumented, +modified, more recent version one, or installing brand new ACPI tables. + +When building initrd with kernel in a single image, option +ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD should also be true for this +feature to work. + +For a full list of ACPI tables that can be upgraded/installed, take a look +at the char `*table_sigs[MAX_ACPI_SIGNATURE];` definition in +drivers/acpi/tables.c. + +All ACPI tables iasl (Intel's ACPI compiler and disassembler) knows should +be overridable, except: + + - ACPI_SIG_RSDP (has a signature of 6 bytes) + - ACPI_SIG_FACS (does not have an ordinary ACPI table header) + +Both could get implemented as well. + + +What is this for +================ + +Complain to your platform/BIOS vendor if you find a bug which is so severe +that a workaround is not accepted in the Linux kernel. And this facility +allows you to upgrade the buggy tables before your platform/BIOS vendor +releases an upgraded BIOS binary. + +This facility can be used by platform/BIOS vendors to provide a Linux +compatible environment without modifying the underlying platform firmware. + +This facility also provides a powerful feature to easily debug and test +ACPI BIOS table compatibility with the Linux kernel by modifying old +platform provided ACPI tables or inserting new ACPI tables. + +It can and should be enabled in any kernel because there is no functional +change with not instrumented initrds. + + +How does it work +================ +:: + + # Extract the machine's ACPI tables: + cd /tmp + acpidump >acpidump + acpixtract -a acpidump + # Disassemble, modify and recompile them: + iasl -d *.dat + # For example add this statement into a _PRT (PCI Routing Table) function + # of the DSDT: + Store("HELLO WORLD", debug) + # And increase the OEM Revision. For example, before modification: + DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000000) + # After modification: + DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000001) + iasl -sa dsdt.dsl + # Add the raw ACPI tables to an uncompressed cpio archive. + # They must be put into a /kernel/firmware/acpi directory inside the cpio + # archive. Note that if the table put here matches a platform table + # (similar Table Signature, and similar OEMID, and similar OEM Table ID) + # with a more recent OEM Revision, the platform table will be upgraded by + # this table. If the table put here doesn't match a platform table + # (dissimilar Table Signature, or dissimilar OEMID, or dissimilar OEM Table + # ID), this table will be appended. + mkdir -p kernel/firmware/acpi + cp dsdt.aml kernel/firmware/acpi + # A maximum of "NR_ACPI_INITRD_TABLES (64)" tables are currently allowed + # (see osl.c): + iasl -sa facp.dsl + iasl -sa ssdt1.dsl + cp facp.aml kernel/firmware/acpi + cp ssdt1.aml kernel/firmware/acpi + # The uncompressed cpio archive must be the first. Other, typically + # compressed cpio archives, must be concatenated on top of the uncompressed + # one. Following command creates the uncompressed cpio archive and + # concatenates the original initrd on top: + find kernel | cpio -H newc --create > /boot/instrumented_initrd + cat /boot/initrd >>/boot/instrumented_initrd + # reboot with increased acpi debug level, e.g. boot params: + acpi.debug_level=0x2 acpi.debug_layer=0xFFFFFFFF + # and check your syslog: + [ 1.268089] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] + [ 1.272091] [ACPI Debug] String [0x0B] "HELLO WORLD" + +iasl is able to disassemble and recompile quite a lot different, +also static ACPI tables. + + +Where to retrieve userspace tools +================================= + +iasl and acpixtract are part of Intel's ACPICA project: +http://acpica.org/ + +and should be packaged by distributions (for example in the acpica package +on SUSE). + +acpidump can be found in Len Browns pmtools: +ftp://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools/acpidump + +This tool is also part of the acpica package on SUSE. +Alternatively, used ACPI tables can be retrieved via sysfs in latest kernels: +/sys/firmware/acpi/tables diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst new file mode 100644 index 000000000000..da37455f96c9 --- /dev/null +++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst @@ -0,0 +1,180 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +SSDT Overlays +============= + +In order to support ACPI open-ended hardware configurations (e.g. development +boards) we need a way to augment the ACPI configuration provided by the firmware +image. A common example is connecting sensors on I2C / SPI buses on development +boards. + +Although this can be accomplished by creating a kernel platform driver or +recompiling the firmware image with updated ACPI tables, neither is practical: +the former proliferates board specific kernel code while the latter requires +access to firmware tools which are often not publicly available. + +Because ACPI supports external references in AML code a more practical +way to augment firmware ACPI configuration is by dynamically loading +user defined SSDT tables that contain the board specific information. + +For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the +Minnowboard MAX development board exposed via the LSE connector [1], the +following ASL code can be used:: + + DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003) + { + External (\_SB.I2C6, DeviceObj) + + Scope (\_SB.I2C6) + { + Device (STAC) + { + Name (_ADR, Zero) + Name (_HID, "BMA222E") + + Method (_CRS, 0, Serialized) + { + Name (RBUF, ResourceTemplate () + { + I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80, + AddressingMode7Bit, "\\_SB.I2C6", 0x00, + ResourceConsumer, ,) + GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000, + "\\_SB.GPO2", 0x00, ResourceConsumer, , ) + { // Pin list + 0 + } + }) + Return (RBUF) + } + } + } + } + +which can then be compiled to AML binary format:: + + $ iasl minnowmax.asl + + Intel ACPI Component Architecture + ASL Optimizing Compiler version 20140214-64 [Mar 29 2014] + Copyright (c) 2000 - 2014 Intel Corporation + + ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords + AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes + +[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29 + +The resulting AML code can then be loaded by the kernel using one of the methods +below. + +Loading ACPI SSDTs from initrd +============================== + +This option allows loading of user defined SSDTs from initrd and it is useful +when the system does not support EFI or when there is not enough EFI storage. + +It works in a similar way with initrd based ACPI tables override/upgrade: SSDT +aml code must be placed in the first, uncompressed, initrd under the +"kernel/firmware/acpi" path. Multiple files can be used and this will translate +in loading multiple tables. Only SSDT and OEM tables are allowed. See +initrd_table_override.txt for more details. + +Here is an example:: + + # Add the raw ACPI tables to an uncompressed cpio archive. + # They must be put into a /kernel/firmware/acpi directory inside the + # cpio archive. + # The uncompressed cpio archive must be the first. + # Other, typically compressed cpio archives, must be + # concatenated on top of the uncompressed one. + mkdir -p kernel/firmware/acpi + cp ssdt.aml kernel/firmware/acpi + + # Create the uncompressed cpio archive and concatenate the original initrd + # on top: + find kernel | cpio -H newc --create > /boot/instrumented_initrd + cat /boot/initrd >>/boot/instrumented_initrd + +Loading ACPI SSDTs from EFI variables +===================================== + +This is the preferred method, when EFI is supported on the platform, because it +allows a persistent, OS independent way of storing the user defined SSDTs. There +is also work underway to implement EFI support for loading user defined SSDTs +and using this method will make it easier to convert to the EFI loading +mechanism when that will arrive. + +In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line +parameter can be used. The argument for the option is the variable name to +use. If there are multiple variables with the same name but with different +vendor GUIDs, all of them will be loaded. + +In order to store the AML code in an EFI variable the efivarfs filesystem can be +used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all +recent distribution. + +Creating a new file in /sys/firmware/efi/efivars will automatically create a new +EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI +variable. Please note that the file name needs to be specially formatted as +"Name-GUID" and that the first 4 bytes in the file (little-endian format) +represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in +include/linux/efi.h). Writing to the file must also be done with one write +operation. + +For example, you can use the following bash script to create/update an EFI +variable with the content from a given file:: + + #!/bin/sh -e + + while ! [ -z "$1" ]; do + case "$1" in + "-f") filename="$2"; shift;; + "-g") guid="$2"; shift;; + *) name="$1";; + esac + shift + done + + usage() + { + echo "Syntax: ${0##*/} -f filename [ -g guid ] name" + exit 1 + } + + [ -n "$name" -a -f "$filename" ] || usage + + EFIVARFS="/sys/firmware/efi/efivars" + + [ -d "$EFIVARFS" ] || exit 2 + + if stat -tf $EFIVARFS | grep -q -v de5e81e4; then + mount -t efivarfs none $EFIVARFS + fi + + # try to pick up an existing GUID + [ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-) + + # use a randomly generated GUID + [ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)" + + # efivarfs expects all of the data in one write + tmp=$(mktemp) + /bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp + dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp) + rm $tmp + +Loading ACPI SSDTs from configfs +================================ + +This option allows loading of user defined SSDTs from userspace via the configfs +interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be +mounted. In the following examples, we assume that configfs has been mounted in +/config. + +New tables can be loading by creating new directories in /config/acpi/table/ and +writing the SSDT aml code in the aml attribute:: + + cd /config/acpi/table + mkdir my_ssdt + cat ~/ssdt.aml > my_ssdt/aml diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 20f92c16ffbf..88e746074252 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup." populated 1 if the cgroup or its descendants contains any live processes; otherwise, 0. + frozen + 1 if the cgroup is frozen; otherwise, 0. cgroup.max.descendants A read-write single value files. The default is "max". @@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + cgroup.freeze + A read-write single value file which exists on non-root cgroups. + Allowed values are "0" and "1". The default is "0". + + Writing "1" to the file causes freezing of the cgroup and all + descendant cgroups. This means that all belonging processes will + be stopped and will not run until the cgroup will be explicitly + unfrozen. Freezing of the cgroup may take some time; when this action + is completed, the "frozen" value in the cgroup.events control file + will be updated to "1" and the corresponding notification will be + issued. + + A cgroup can be frozen either by its own settings, or by settings + of any ancestor cgroups. If any of ancestor cgroups is frozen, the + cgroup will remain frozen. + + Processes in the frozen cgroup can be killed by a fatal signal. + They also can enter and leave a frozen cgroup: either by an explicit + move by a user, or if freezing of the cgroup races with fork(). + If a process is moved to a frozen cgroup, it stops. If a process is + moved out of a frozen cgroup, it becomes running. + + Frozen status of a cgroup doesn't affect any cgroup tree operations: + it's possible to delete a frozen (and empty) cgroup, as well as + create new sub-cgroups. Controllers =========== diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index e506d3dae510..059ddcbe769d 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -91,10 +91,48 @@ Currently Available * large block (up to pagesize) support * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force the ordering) +* Case-insensitive file name lookups [1] Filesystems with a block size of 1k may see a limit imposed by the directory hash tree having a maximum depth of two. +case-insensitive file name lookups +====================================================== + +The case-insensitive file name lookup feature is supported on a +per-directory basis, allowing the user to mix case-insensitive and +case-sensitive directories in the same filesystem. It is enabled by +flipping the +F inode attribute of an empty directory. The +case-insensitive string match operation is only defined when we know how +text in encoded in a byte sequence. For that reason, in order to enable +case-insensitive directories, the filesystem must have the +casefold feature, which stores the filesystem-wide encoding +model used. By default, the charset adopted is the latest version of +Unicode (12.1.0, by the time of this writing), encoded in the UTF-8 +form. The comparison algorithm is implemented by normalizing the +strings to the Canonical decomposition form, as defined by Unicode, +followed by a byte per byte comparison. + +The case-awareness is name-preserving on the disk, meaning that the file +name provided by userspace is a byte-per-byte match to what is actually +written in the disk. The Unicode normalization format used by the +kernel is thus an internal representation, and not exposed to the +userspace nor to the disk, with the important exception of disk hashes, +used on large case-insensitive directories with DX feature. On DX +directories, the hash must be calculated using the casefolded version of +the filename, meaning that the normalization format used actually has an +impact on where the directory entry is stored. + +When we change from viewing filenames as opaque byte sequences to seeing +them as encoded strings we need to address what happens when a program +tries to create a file with an invalid name. The Unicode subsystem +within the kernel leaves the decision of what to do in this case to the +filesystem, which select its preferred behavior by enabling/disabling +the strict mode. When Ext4 encounters one of those strings and the +filesystem did not require strict mode, it falls back to considering the +entire string as an opaque byte sequence, which still allows the user to +operate on that file, but the case-insensitive lookups won't work. + Options ======= diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst new file mode 100644 index 000000000000..ffc064c1ec68 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -0,0 +1,13 @@ +======================== +Hardware vulnerabilities +======================== + +This section describes CPU vulnerabilities and provides an overview of the +possible mitigations along with guidance for selecting mitigations if they +are configurable at compile, boot or run time. + +.. toctree:: + :maxdepth: 1 + + l1tf + mds diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst index 9af977384168..31653a9f0e1b 100644 --- a/Documentation/admin-guide/l1tf.rst +++ b/Documentation/admin-guide/hw-vuln/l1tf.rst @@ -445,6 +445,7 @@ The default is 'cond'. If 'l1tf=full,force' is given on the kernel command line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush module parameter is ignored and writes to the sysfs file are rejected. +.. _mitigation_selection: Mitigation selection guide -------------------------- diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst new file mode 100644 index 000000000000..e3a796c0d3a2 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -0,0 +1,308 @@ +MDS - Microarchitectural Data Sampling +====================================== + +Microarchitectural Data Sampling is a hardware vulnerability which allows +unprivileged speculative access to data which is available in various CPU +internal buffers. + +Affected processors +------------------- + +This vulnerability affects a wide range of Intel processors. The +vulnerability is not present on: + + - Processors from AMD, Centaur and other non Intel vendors + + - Older processor models, where the CPU family is < 6 + + - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus) + + - Intel processors which have the ARCH_CAP_MDS_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. + +Whether a processor is affected or not can be read out from the MDS +vulnerability file in sysfs. See :ref:`mds_sys_info`. + +Not all processors are affected by all variants of MDS, but the mitigation +is identical for all of them so the kernel treats them as a single +vulnerability. + +Related CVEs +------------ + +The following CVE entries are related to the MDS vulnerability: + + ============== ===== =================================================== + CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling + CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling + CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling + CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory + ============== ===== =================================================== + +Problem +------- + +When performing store, load, L1 refill operations, processors write data +into temporary microarchitectural structures (buffers). The data in the +buffer can be forwarded to load operations as an optimization. + +Under certain conditions, usually a fault/assist caused by a load +operation, data unrelated to the load memory address can be speculatively +forwarded from the buffers. Because the load operation causes a fault or +assist and its result will be discarded, the forwarded data will not cause +incorrect program execution or state changes. But a malicious operation +may be able to forward this speculative data to a disclosure gadget which +allows in turn to infer the value via a cache side channel attack. + +Because the buffers are potentially shared between Hyper-Threads cross +Hyper-Thread attacks are possible. + +Deeper technical information is available in the MDS specific x86 +architecture section: :ref:`Documentation/x86/mds.rst <mds>`. + + +Attack scenarios +---------------- + +Attacks against the MDS vulnerabilities can be mounted from malicious non +priviledged user space applications running on hosts or guest. Malicious +guest OSes can obviously mount attacks as well. + +Contrary to other speculation based vulnerabilities the MDS vulnerability +does not allow the attacker to control the memory target address. As a +consequence the attacks are purely sampling based, but as demonstrated with +the TLBleed attack samples can be postprocessed successfully. + +Web-Browsers +^^^^^^^^^^^^ + + It's unclear whether attacks through Web-Browsers are possible at + all. The exploitation through Java-Script is considered very unlikely, + but other widely used web technologies like Webassembly could possibly be + abused. + + +.. _mds_sys_info: + +MDS system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current MDS +status of the system: whether the system is vulnerable, and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/mds + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable + * - 'Vulnerable' + - The processor is vulnerable, but no mitigation enabled + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The processor is vulnerable but microcode is not updated. + + The mitigation is enabled on a best effort basis. See :ref:`vmwerv` + * - 'Mitigation: Clear CPU buffers' + - The processor is vulnerable and the CPU buffer clearing mitigation is + enabled. + +If the processor is vulnerable then the following information is appended +to the above information: + + ======================== ============================================ + 'SMT vulnerable' SMT is enabled + 'SMT mitigated' SMT is enabled and mitigated + 'SMT disabled' SMT is disabled + 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown + ======================== ============================================ + +.. _vmwerv: + +Best effort mitigation mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the processor is vulnerable, but the availability of the microcode based + mitigation mechanism is not advertised via CPUID the kernel selects a best + effort mitigation mode. This mode invokes the mitigation instructions + without a guarantee that they clear the CPU buffers. + + This is done to address virtualization scenarios where the host has the + microcode update applied, but the hypervisor is not yet updated to expose + the CPUID to the guest. If the host has updated microcode the protection + takes effect otherwise a few cpu cycles are wasted pointlessly. + + The state in the mds sysfs file reflects this situation accordingly. + + +Mitigation mechanism +------------------------- + +The kernel detects the affected CPUs and the presence of the microcode +which is required. + +If a CPU is affected and the microcode is available, then the kernel +enables the mitigation by default. The mitigation can be controlled at boot +time via a kernel command line option. See +:ref:`mds_mitigation_control_command_line`. + +.. _cpu_buffer_clear: + +CPU buffer clearing +^^^^^^^^^^^^^^^^^^^ + + The mitigation for MDS clears the affected CPU buffers on return to user + space and when entering a guest. + + If SMT is enabled it also clears the buffers on idle entry when the CPU + is only affected by MSBDS and not any other MDS variant, because the + other variants cannot be protected against cross Hyper-Thread attacks. + + For CPUs which are only affected by MSBDS the user space, guest and idle + transition mitigations are sufficient and SMT is not affected. + +.. _virt_mechanism: + +Virtualization mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^ + + The protection for host to guest transition depends on the L1TF + vulnerability of the CPU: + + - CPU is affected by L1TF: + + If the L1D flush mitigation is enabled and up to date microcode is + available, the L1D flush mitigation is automatically protecting the + guest transition. + + If the L1D flush mitigation is disabled then the MDS mitigation is + invoked explicit when the host MDS mitigation is enabled. + + For details on L1TF and virtualization see: + :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`. + + - CPU is not affected by L1TF: + + CPU buffers are flushed before entering the guest when the host MDS + mitigation is enabled. + + The resulting MDS protection matrix for the host to guest transition: + + ============ ===== ============= ============ ================= + L1TF MDS VMX-L1FLUSH Host MDS MDS-State + + Don't care No Don't care N/A Not affected + + Yes Yes Disabled Off Vulnerable + + Yes Yes Disabled Full Mitigated + + Yes Yes Enabled Don't care Mitigated + + No Yes N/A Off Vulnerable + + No Yes N/A Full Mitigated + ============ ===== ============= ============ ================= + + This only covers the host to guest transition, i.e. prevents leakage from + host to guest, but does not protect the guest internally. Guests need to + have their own protections. + +.. _xeon_phi: + +XEON PHI specific considerations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The XEON PHI processor family is affected by MSBDS which can be exploited + cross Hyper-Threads when entering idle states. Some XEON PHI variants allow + to use MWAIT in user space (Ring 3) which opens an potential attack vector + for malicious user space. The exposure can be disabled on the kernel + command line with the 'ring3mwait=disable' command line option. + + XEON PHI is not affected by the other MDS variants and MSBDS is mitigated + before the CPU enters a idle state. As XEON PHI is not affected by L1TF + either disabling SMT is not required for full protection. + +.. _mds_smt_control: + +SMT control +^^^^^^^^^^^ + + All MDS variants except MSBDS can be attacked cross Hyper-Threads. That + means on CPUs which are affected by MFBDS or MLPDS it is necessary to + disable SMT for full protection. These are most of the affected CPUs; the + exception is XEON PHI, see :ref:`xeon_phi`. + + Disabling SMT can have a significant performance impact, but the impact + depends on the type of workloads. + + See the relevant chapter in the L1TF mitigation documentation for details: + :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`. + + +.. _mds_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the MDS mitigations at boot +time with the option "mds=". The valid arguments for this option are: + + ============ ============================================================= + full If the CPU is vulnerable, enable all available mitigations + for the MDS vulnerability, CPU buffer clearing on exit to + userspace and when entering a VM. Idle transitions are + protected as well if SMT is enabled. + + It does not automatically disable SMT. + + full,nosmt The same as mds=full, with SMT disabled on vulnerable + CPUs. This is the complete mitigation. + + off Disables MDS mitigations completely. + + ============ ============================================================= + +Not specifying this option is equivalent to "mds=full". + + +Mitigation selection guide +-------------------------- + +1. Trusted userspace +^^^^^^^^^^^^^^^^^^^^ + + If all userspace applications are from a trusted source and do not + execute untrusted code which is supplied externally, then the mitigation + can be disabled. + + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The same considerations as above versus trusted user space apply. + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The protection depends on the state of the L1TF mitigations. + See :ref:`virt_mechanism`. + + If the MDS mitigation is enabled and SMT is disabled, guest to host and + guest to guest attacks are prevented. + +.. _mds_default_mitigations: + +Default mitigations +------------------- + + The kernel default mitigations for vulnerable processors are: + + - Enable CPU buffer clearing + + The kernel does not by default enforce the disabling of SMT, which leaves + SMT systems vulnerable when running untrusted code. The same rationale as + for L1TF applies. + See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`. diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 0a491676685e..8001917ee012 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -17,14 +17,12 @@ etc. kernel-parameters devices -This section describes CPU vulnerabilities and provides an overview of the -possible mitigations along with guidance for selecting mitigations if they -are configurable at compile, boot or run time. +This section describes CPU vulnerabilities and their mitigations. .. toctree:: :maxdepth: 1 - l1tf + hw-vuln/index Here is a set of documents aimed at users who are trying to track down problems and bugs in particular. @@ -77,6 +75,7 @@ configure specific aspects of kernel behavior to your liking. LSM/index mm/index perf-security + acpi/index .. only:: subproject and html diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index b8d0bc07ed0a..0124980dca2d 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -88,6 +88,7 @@ parameter is applicable:: APIC APIC support is enabled. APM Advanced Power Management support is enabled. ARM ARM architecture is enabled. + ARM64 ARM64 architecture is enabled. AX25 Appropriate AX.25 support is enabled. CLK Common clock infrastructure is enabled. CMA Contiguous Memory Area support is enabled. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2b8ee90bb644..52e6fbb042cc 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -704,8 +704,11 @@ upon panic. This parameter reserves the physical memory region [offset, offset + size] for that kernel image. If '@offset' is omitted, then a suitable offset - is selected automatically. Check - Documentation/kdump/kdump.txt for further details. + is selected automatically. + [KNL, x86_64] select a region under 4G first, and + fall back to reserve region above 4G when '@offset' + hasn't been specified. + See Documentation/kdump/kdump.txt for further details. crashkernel=range1:size1[,range2:size2,...][@offset] [KNL] Same as above, but depends on the memory @@ -1585,7 +1588,7 @@ Format: { "off" | "enforce" | "fix" | "log" } default: "enforce" - ima_appraise_tcb [IMA] + ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead. The builtin appraise policy appraises all files owned by uid=0. @@ -1612,8 +1615,7 @@ uid=0. The "appraise_tcb" policy appraises the integrity of - all files owned by root. (This is the equivalent - of ima_appraise_tcb.) + all files owned by root. The "secure_boot" policy appraises the integrity of files (eg. kexec kernel image, kernel modules, @@ -1828,6 +1830,9 @@ ip= [IP_PNP] See Documentation/filesystems/nfs/nfsroot.txt. + ipcmni_extend [KNL] Extend the maximum number of unique System V + IPC identifiers from 32,768 to 16,777,216. + irqaffinity= [SMP] Set the default irq affinity mask The argument is a cpu list, as described above. @@ -2141,7 +2146,7 @@ Default is 'flush'. - For details see: Documentation/admin-guide/l1tf.rst + For details see: Documentation/admin-guide/hw-vuln/l1tf.rst l2cr= [PPC] @@ -2387,6 +2392,32 @@ Format: <first>,<last> Specifies range of consoles to be captured by the MDA. + mds= [X86,INTEL] + Control mitigation for the Micro-architectural Data + Sampling (MDS) vulnerability. + + Certain CPUs are vulnerable to an exploit against CPU + internal buffers which can forward information to a + disclosure gadget under certain conditions. + + In vulnerable processors, the speculatively + forwarded data can be used in a cache side channel + attack, to access data to which the attacker does + not have direct access. + + This parameter controls the MDS mitigation. The + options are: + + full - Enable MDS mitigation on vulnerable CPUs + full,nosmt - Enable MDS mitigation and disable + SMT on vulnerable CPUs + off - Unconditionally disable MDS mitigation + + Not specifying this option is equivalent to + mds=full. + + For details see: Documentation/admin-guide/hw-vuln/mds.rst + mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used when the kernel is not able to see the whole system memory or for test. @@ -2544,6 +2575,42 @@ in the "bleeding edge" mini2440 support kernel at http://repo.or.cz/w/linux-2.6/mini2440.git + mitigations= + [X86,PPC,S390,ARM64] Control optional mitigations for + CPU vulnerabilities. This is a set of curated, + arch-independent options, each of which is an + aggregation of existing arch-specific options. + + off + Disable all optional CPU mitigations. This + improves system performance, but it may also + expose users to several CPU vulnerabilities. + Equivalent to: nopti [X86,PPC] + kpti=0 [ARM64] + nospectre_v1 [PPC] + nobp=0 [S390] + nospectre_v2 [X86,PPC,S390,ARM64] + spectre_v2_user=off [X86] + spec_store_bypass_disable=off [X86,PPC] + ssbd=force-off [ARM64] + l1tf=off [X86] + mds=off [X86] + + auto (default) + Mitigate all CPU vulnerabilities, but leave SMT + enabled, even if it's vulnerable. This is for + users who don't want to be surprised by SMT + getting disabled across kernel upgrades, or who + have other ways of avoiding SMT-based attacks. + Equivalent to: (default behavior) + + auto,nosmt + Mitigate all CPU vulnerabilities, disabling SMT + if needed. This is for users who always want to + be fully mitigated, even if it means losing SMT. + Equivalent to: l1tf=flush,nosmt [X86] + mds=full,nosmt [X86] + mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -2839,11 +2906,11 @@ noexec=on: enable non-executable mappings (default) noexec=off: disable non-executable mappings - nosmap [X86] + nosmap [X86,PPC] Disable SMAP (Supervisor Mode Access Prevention) even if it is supported by processor. - nosmep [X86] + nosmep [X86,PPC] Disable SMEP (Supervisor Mode Execution Prevention) even if it is supported by processor. @@ -2873,10 +2940,10 @@ check bypass). With this option data leaks are possible in the system. - nospectre_v2 [X86,PPC_FSL_BOOK3E] Disable all mitigations for the Spectre variant 2 - (indirect branch prediction) vulnerability. System may - allow data leaks with this option, which is equivalent - to spectre_v2=off. + nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for + the Spectre variant 2 (indirect branch prediction) + vulnerability. System may allow data leaks with this + option. nospec_store_bypass_disable [HW] Disable all mitigations for the Speculative Store Bypass vulnerability @@ -3110,6 +3177,16 @@ This will also cause panics on machine check exceptions. Useful together with panic=30 to trigger a reboot. + page_alloc.shuffle= + [KNL] Boolean flag to control whether the page allocator + should randomize its free lists. The randomization may + be automatically enabled if the kernel detects it is + running on a platform with a direct-mapped memory-side + cache, and this parameter can be used to + override/disable that behavior. The state of the flag + can be read from sysfs at: + /sys/module/page_alloc/parameters/shuffle. + page_owner= [KNL] Boot-time page_owner enabling option. Storage of the information about who allocated each page is disabled in default. With this switch, @@ -3394,6 +3471,8 @@ bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group. + force_floating [S390] Force usage of floating interrupts. + nomio [S390] Do not use MIO instructions. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. @@ -3623,7 +3702,9 @@ see CONFIG_RAS_CEC help text. rcu_nocbs= [KNL] - The argument is a cpu list, as described above. + The argument is a cpu list, as described above, + except that the string "all" can be used to + specify every CPU on the system. In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs. @@ -3986,7 +4067,9 @@ [[,]s[mp]#### \ [[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \ [[,]f[orce] - Where reboot_mode is one of warm (soft) or cold (hard) or gpio, + Where reboot_mode is one of warm (soft) or cold (hard) or gpio + (prefix with 'panic_' to set mode for panic + reboot only), reboot_type is one of bios, acpi, kbd, triple, efi, or pci, reboot_force is either force or not specified, reboot_cpu is s[mp]#### with #### being the processor @@ -4703,6 +4786,10 @@ [x86] unstable: mark the TSC clocksource as unstable, this marks the TSC unconditionally unstable at bootup and avoids any further wobbles once the TSC watchdog notices. + [x86] nowatchdog: disable clocksource watchdog. Used + in situations with strict latency requirements (where + interruptions from clocksource watchdog are not + acceptable). turbografx.map[2|3]= [HW,JOY] TurboGraFX parallel port interface @@ -5173,6 +5260,13 @@ with /sys/devices/system/xen_memory/xen_memory0/scrub_pages. Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT. + xen_timer_slop= [X86-64,XEN] + Set the timer slop (in nanoseconds) for the virtual Xen + timers (default is 100000). This adjusts the minimum + delta of virtualized Xen timers, where lower values + improve timer resolution at the expense of processing + more timer interrupts. + xirc2ps_cs= [NET,PCMCIA] Format: <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst new file mode 100644 index 000000000000..b79f70c04397 --- /dev/null +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -0,0 +1,169 @@ +.. _numaperf: + +============= +NUMA Locality +============= + +Some platforms may have multiple types of memory attached to a compute +node. These disparate memory ranges may share some characteristics, such +as CPU cache coherence, but may have different performance. For example, +different media types and buses affect bandwidth and latency. + +A system supports such heterogeneous memory by grouping each memory type +under different domains, or "nodes", based on locality and performance +characteristics. Some memory may share the same node as a CPU, and others +are provided as memory only nodes. While memory only nodes do not provide +CPUs, they may still be local to one or more compute nodes relative to +other nodes. The following diagram shows one such example of two compute +nodes with local memory and a memory only node for each of compute node: + + +------------------+ +------------------+ + | Compute Node 0 +-----+ Compute Node 1 | + | Local Node0 Mem | | Local Node1 Mem | + +--------+---------+ +--------+---------+ + | | + +--------+---------+ +--------+---------+ + | Slower Node2 Mem | | Slower Node3 Mem | + +------------------+ +--------+---------+ + +A "memory initiator" is a node containing one or more devices such as +CPUs or separate memory I/O devices that can initiate memory requests. +A "memory target" is a node containing one or more physical address +ranges accessible from one or more memory initiators. + +When multiple memory initiators exist, they may not all have the same +performance when accessing a given memory target. Each initiator-target +pair may be organized into different ranked access classes to represent +this relationship. The highest performing initiator to a given target +is considered to be one of that target's local initiators, and given +the highest access class, 0. Any given target may have one or more +local initiators, and any given initiator may have multiple local +memory targets. + +To aid applications matching memory targets with their initiators, the +kernel provides symlinks to each other. The following example lists the +relationship for the access class "0" memory initiators and targets:: + + # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ + relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY + + # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ + relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX + +A memory initiator may have multiple memory targets in the same access +class. The target memory's initiators in a given class indicate the +nodes' access characteristics share the same performance relative to other +linked initiator nodes. Each target within an initiator's access class, +though, do not necessarily perform the same as each other. + +================ +NUMA Performance +================ + +Applications may wish to consider which node they want their memory to +be allocated from based on the node's performance characteristics. If +the system provides these attributes, the kernel exports them under the +node sysfs hierarchy by appending the attributes directory under the +memory node's access class 0 initiators as follows:: + + /sys/devices/system/node/nodeY/access0/initiators/ + +These attributes apply only when accessed from nodes that have the +are linked under the this access's inititiators. + +The performance characteristics the kernel provides for the local initiators +are exported are as follows:: + + # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ + /sys/devices/system/node/nodeY/access0/initiators/ + |-- read_bandwidth + |-- read_latency + |-- write_bandwidth + `-- write_latency + +The bandwidth attributes are provided in MiB/second. + +The latency attributes are provided in nanoseconds. + +The values reported here correspond to the rated latency and bandwidth +for the platform. + +========== +NUMA Cache +========== + +System memory may be constructed in a hierarchy of elements with various +performance characteristics in order to provide large address space of +slower performing memory cached by a smaller higher performing memory. The +system physical addresses memory initiators are aware of are provided +by the last memory level in the hierarchy. The system meanwhile uses +higher performing memory to transparently cache access to progressively +slower levels. + +The term "far memory" is used to denote the last level memory in the +hierarchy. Each increasing cache level provides higher performing +initiator access, and the term "near memory" represents the fastest +cache provided by the system. + +This numbering is different than CPU caches where the cache level (ex: +L1, L2, L3) uses the CPU-side view where each increased level is lower +performing. In contrast, the memory cache level is centric to the last +level memory, so the higher numbered cache level corresponds to memory +nearer to the CPU, and further from far memory. + +The memory-side caches are not directly addressable by software. When +software accesses a system address, the system will return it from the +near memory cache if it is present. If it is not present, the system +accesses the next level of memory until there is either a hit in that +cache level, or it reaches far memory. + +An application does not need to know about caching attributes in order +to use the system. Software may optionally query the memory cache +attributes in order to maximize the performance out of such a setup. +If the system provides a way for the kernel to discover this information, +for example with ACPI HMAT (Heterogeneous Memory Attribute Table), +the kernel will append these attributes to the NUMA node memory target. + +When the kernel first registers a memory cache with a node, the kernel +will create the following directory:: + + /sys/devices/system/node/nodeX/memory_side_cache/ + +If that directory is not present, the system either does not not provide +a memory-side cache, or that information is not accessible to the kernel. + +The attributes for each level of cache is provided under its cache +level index:: + + /sys/devices/system/node/nodeX/memory_side_cache/indexA/ + /sys/devices/system/node/nodeX/memory_side_cache/indexB/ + /sys/devices/system/node/nodeX/memory_side_cache/indexC/ + +Each cache level's directory provides its attributes. For example, the +following shows a single cache level and the attributes available for +software to query:: + + # tree sys/devices/system/node/node0/memory_side_cache/ + /sys/devices/system/node/node0/memory_side_cache/ + |-- index1 + | |-- indexing + | |-- line_size + | |-- size + | `-- write_policy + +The "indexing" will be 0 if it is a direct-mapped cache, and non-zero +for any other indexed based, multi-way associativity. + +The "line_size" is the number of bytes accessed from the next cache +level on a miss. + +The "size" is the number of bytes provided by this cache level. + +The "write_policy" will be 0 for write-back, and non-zero for +write-through caching. + +======== +See Also +======== +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf + Section 5.2.27 diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 7eca9026a9ed..0c74a7784964 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` @@ -5,9 +8,10 @@ CPU Performance Scaling ======================= -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> The Concept of CPU Performance Scaling ====================================== @@ -396,8 +400,8 @@ RT or deadline scheduling classes, the governor will increase the frequency to the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, if it is invoked by the CFS scheduling class, the governor will use the Per-Entity Load Tracking (PELT) metric for the root control group of the -given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ -LWN.net article for a description of the PELT mechanism). Then, the new +given CPU as the CPU utilization estimate (see the *Per-entity load tracking* +LWN.net article [1]_ for a description of the PELT mechanism). Then, the new CPU frequency to apply is computed in accordance with the formula f = 1.25 * ``f_0`` * ``util`` / ``max`` @@ -698,4 +702,8 @@ hardware feature (e.g. all Intel ones), even if the :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. -.. _Per-entity load tracking: https://lwn.net/Articles/531853/ +References +========== + +.. [1] Jonathan Corbet, *Per-entity load tracking*, + https://lwn.net/Articles/531853/ diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 9c58b35a81cb..e70b365dbc60 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + .. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>` .. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>` @@ -5,9 +8,10 @@ CPU Idle Time Management ======================== -:: +:Copyright: |copy| 2018 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2018 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> Concepts ======== diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst index 49237ac73442..39f8f9f81e7a 100644 --- a/Documentation/admin-guide/pm/index.rst +++ b/Documentation/admin-guide/pm/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ================ Power Management ================ diff --git a/Documentation/admin-guide/pm/intel_epb.rst b/Documentation/admin-guide/pm/intel_epb.rst new file mode 100644 index 000000000000..005121167af7 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_epb.rst @@ -0,0 +1,41 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +====================================== +Intel Performance and Energy Bias Hint +====================================== + +:Copyright: |copy| 2019 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + + +.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c + :doc: overview + +Intel Performance and Energy Bias Attribute in ``sysfs`` +======================================================== + +The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU +can be checked or updated through a ``sysfs`` attribute (file) under +:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>`` +is allocated at the system initialization time: + +``energy_perf_bias`` + Shows the current EPB value for the CPU in a sliding scale 0 - 15, where + a value of 0 corresponds to a hint preference for highest performance + and a value of 15 corresponds to the maximum energy savings. + + In order to update the EPB value for the CPU, this attribute can be + written to, either with a number in the 0 - 15 sliding scale above, or + with one of the strings: "performance", "balance-performance", "normal", + "balance-power", "power" that represent values reflected by their + meaning. + + This attribute is present for all online CPUs supporting the EPB + feature. + +Note that while the EPB interface to the processor is defined at the logical CPU +level, the physical register backing it may be shared by multiple CPUs (for +example, SMT siblings or cores in one package). For this reason, updating the +EPB value for one CPU may cause the EPB values for other CPUs to change. diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index ec0f7c111f65..67e414e34f37 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -1,10 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =============================================== ``intel_pstate`` CPU Performance Scaling Driver =============================================== -:: +:Copyright: |copy| 2017 Intel Corporation - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> General Information @@ -20,11 +23,10 @@ you have not done that yet.] For the processors supported by ``intel_pstate``, the P-state concept is broader than just an operating frequency or an operating performance point (see the -`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more +LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more information about that). For this reason, the representation of P-states used by ``intel_pstate`` internally follows the hardware specification (for details -refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual -Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core +refer to Intel Software Developer’s Manual [2]_). However, the ``CPUFreq`` core uses frequencies for identifying operating performance points of CPUs and frequencies are involved in the user space interface exposed by it, so ``intel_pstate`` maps its internal representation of P-states to frequencies too @@ -561,9 +563,9 @@ or to pin every task potentially sensitive to them to a specific CPU.] On the majority of systems supported by ``intel_pstate``, the ACPI tables provided by the platform firmware contain ``_PSS`` objects returning information -that can be used for CPU performance scaling (refer to the `ACPI specification`_ -for details on the ``_PSS`` objects and the format of the information returned -by them). +that can be used for CPU performance scaling (refer to the ACPI specification +[3]_ for details on the ``_PSS`` objects and the format of the information +returned by them). The information returned by the ACPI ``_PSS`` objects is used by the ``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate`` @@ -728,6 +730,14 @@ P-state is called, the ``ftrace`` filter can be set to to <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func -.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf -.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html -.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf +References +========== + +.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*, + http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf + +.. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*, + http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html + +.. [3] *Advanced Configuration and Power Interface Specification*, + https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst index dbf5acd49f35..cd3a28cb81f4 100644 --- a/Documentation/admin-guide/pm/sleep-states.rst +++ b/Documentation/admin-guide/pm/sleep-states.rst @@ -1,10 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =================== System Sleep States =================== -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> Sleep states are global low-power states of the entire system in which user space code cannot be executed and the overall system activity is significantly diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst index afe4d3f831fe..dd0362e32fa5 100644 --- a/Documentation/admin-guide/pm/strategies.rst +++ b/Documentation/admin-guide/pm/strategies.rst @@ -1,10 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =========================== Power Management Strategies =========================== -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> The Linux kernel supports two major high-level power management strategies. diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst index 0c81e4c5de39..2b1f987b34f0 100644 --- a/Documentation/admin-guide/pm/system-wide.rst +++ b/Documentation/admin-guide/pm/system-wide.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============================ System-Wide Power Management ============================ diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index b6cef9b5e961..fc298eb1234b 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============================== Working-State Power Management ============================== @@ -8,3 +10,4 @@ Working-State Power Management cpuidle cpufreq intel_pstate + intel_epb |