summaryrefslogtreecommitdiff
path: root/Documentation/powerpc
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/powerpc')
-rw-r--r--Documentation/powerpc/bootwrapper.rst (renamed from Documentation/powerpc/bootwrapper.txt)28
-rw-r--r--Documentation/powerpc/cpu_families.rst (renamed from Documentation/powerpc/cpu_families.txt)23
-rw-r--r--Documentation/powerpc/cpu_features.rst (renamed from Documentation/powerpc/cpu_features.txt)6
-rw-r--r--Documentation/powerpc/cxl.rst (renamed from Documentation/powerpc/cxl.txt)46
-rw-r--r--Documentation/powerpc/cxlflash.rst (renamed from Documentation/powerpc/cxlflash.txt)10
-rw-r--r--Documentation/powerpc/dawr-power9.rst (renamed from Documentation/powerpc/DAWR-POWER9.txt)15
-rw-r--r--Documentation/powerpc/dscr.rst (renamed from Documentation/powerpc/dscr.txt)18
-rw-r--r--Documentation/powerpc/eeh-pci-error-recovery.rst (renamed from Documentation/powerpc/eeh-pci-error-recovery.txt)108
-rw-r--r--Documentation/powerpc/elfnote.rst41
-rw-r--r--Documentation/powerpc/firmware-assisted-dump.rst361
-rw-r--r--Documentation/powerpc/firmware-assisted-dump.txt292
-rw-r--r--Documentation/powerpc/hvcs.rst (renamed from Documentation/powerpc/hvcs.txt)108
-rw-r--r--Documentation/powerpc/index.rst36
-rw-r--r--Documentation/powerpc/isa-versions.rst13
-rw-r--r--Documentation/powerpc/mpc52xx.rst (renamed from Documentation/powerpc/mpc52xx.txt)12
-rw-r--r--Documentation/powerpc/pci_iov_resource_on_powernv.rst (renamed from Documentation/powerpc/pci_iov_resource_on_powernv.txt)15
-rw-r--r--Documentation/powerpc/pmu-ebb.rst (renamed from Documentation/powerpc/pmu-ebb.txt)1
-rw-r--r--Documentation/powerpc/ptrace.rst156
-rw-r--r--Documentation/powerpc/ptrace.txt151
-rw-r--r--Documentation/powerpc/qe_firmware.rst (renamed from Documentation/powerpc/qe_firmware.txt)37
-rw-r--r--Documentation/powerpc/syscall64-abi.rst (renamed from Documentation/powerpc/syscall64-abi.txt)29
-rw-r--r--Documentation/powerpc/transactional_memory.rst (renamed from Documentation/powerpc/transactional_memory.txt)45
-rw-r--r--Documentation/powerpc/ultravisor.rst1054
-rw-r--r--Documentation/powerpc/vcpudispatch_stats.txt68
24 files changed, 2018 insertions, 655 deletions
diff --git a/Documentation/powerpc/bootwrapper.txt b/Documentation/powerpc/bootwrapper.rst
index d60fced5e1cc..a6292afba573 100644
--- a/Documentation/powerpc/bootwrapper.txt
+++ b/Documentation/powerpc/bootwrapper.rst
@@ -1,5 +1,7 @@
+========================
The PowerPC boot wrapper
-------------------------
+========================
+
Copyright (C) Secret Lab Technologies Ltd.
PowerPC image targets compresses and wraps the kernel image (vmlinux) with
@@ -21,6 +23,7 @@ it uses the wrapper script (arch/powerpc/boot/wrapper) to generate target
image. The details of the build system is discussed in the next section.
Currently, the following image format targets exist:
+ ==================== ========================================================
cuImage.%: Backwards compatible uImage for older version of
U-Boot (for versions that don't understand the device
tree). This image embeds a device tree blob inside
@@ -29,31 +32,36 @@ Currently, the following image format targets exist:
with boot wrapper code that extracts data from the old
bd_info structure and loads the data into the device
tree before jumping into the kernel.
- Because of the series of #ifdefs found in the
+
+ Because of the series of #ifdefs found in the
bd_info structure used in the old U-Boot interfaces,
cuImages are platform specific. Each specific
U-Boot platform has a different platform init file
which populates the embedded device tree with data
from the platform specific bd_info file. The platform
specific cuImage platform init code can be found in
- arch/powerpc/boot/cuboot.*.c. Selection of the correct
+ `arch/powerpc/boot/cuboot.*.c`. Selection of the correct
cuImage init code for a specific board can be found in
the wrapper structure.
+
dtbImage.%: Similar to zImage, except device tree blob is embedded
inside the image instead of provided by firmware. The
output image file can be either an elf file or a flat
binary depending on the platform.
- dtbImages are used on systems which do not have an
+
+ dtbImages are used on systems which do not have an
interface for passing a device tree directly.
dtbImages are similar to simpleImages except that
dtbImages have platform specific code for extracting
data from the board firmware, but simpleImages do not
talk to the firmware at all.
- PlayStation 3 support uses dtbImage. So do Embedded
+
+ PlayStation 3 support uses dtbImage. So do Embedded
Planet boards using the PlanetCore firmware. Board
specific initialization code is typically found in a
file named arch/powerpc/boot/<platform>.c; but this
can be overridden by the wrapper script.
+
simpleImage.%: Firmware independent compressed image that does not
depend on any particular firmware interface and embeds
a device tree blob. This image is a flat binary that
@@ -61,14 +69,16 @@ Currently, the following image format targets exist:
Firmware cannot pass any configuration data to the
kernel with this image type and it depends entirely on
the embedded device tree for all information.
- The simpleImage is useful for booting systems with
+
+ The simpleImage is useful for booting systems with
an unknown firmware interface or for booting from
a debugger when no firmware is present (such as on
the Xilinx Virtex platform). The only assumption that
simpleImage makes is that RAM is correctly initialized
and that the MMU is either off or has RAM mapped to
base address 0.
- simpleImage also supports inserting special platform
+
+ simpleImage also supports inserting special platform
specific initialization code to the start of the bootup
sequence. The virtex405 platform uses this feature to
ensure that the cache is invalidated before caching
@@ -81,9 +91,11 @@ Currently, the following image format targets exist:
named (virtex405-<board>.dts). Search the wrapper
script for 'virtex405' and see the file
arch/powerpc/boot/virtex405-head.S for details.
+
treeImage.%; Image format for used with OpenBIOS firmware found
on some ppc4xx hardware. This image embeds a device
tree blob inside the image.
+
uImage: Native image format used by U-Boot. The uImage target
does not add any boot code. It just wraps a compressed
vmlinux in the uImage data structure. This image
@@ -91,12 +103,14 @@ Currently, the following image format targets exist:
a device tree to the kernel at boot. If using an older
version of U-Boot, then you need to use a cuImage
instead.
+
zImage.%: Image format which does not embed a device tree.
Used by OpenFirmware and other firmware interfaces
which are able to supply a device tree. This image
expects firmware to provide the device tree at boot.
Typically, if you have general purpose PowerPC
hardware then you want this image format.
+ ==================== ========================================================
Image types which embed a device tree blob (simpleImage, dtbImage, treeImage,
and cuImage) all generate the device tree blob from a file in the
diff --git a/Documentation/powerpc/cpu_families.txt b/Documentation/powerpc/cpu_families.rst
index fc08e22feb1a..1e063c5440c3 100644
--- a/Documentation/powerpc/cpu_families.txt
+++ b/Documentation/powerpc/cpu_families.rst
@@ -1,3 +1,4 @@
+============
CPU Families
============
@@ -8,8 +9,8 @@ and are supported by arch/powerpc.
Book3S (aka sPAPR)
------------------
- - Hash MMU
- - Mix of 32 & 64 bit
+- Hash MMU
+- Mix of 32 & 64 bit::
+--------------+ +----------------+
| Old POWER | --------------> | RS64 (threads) |
@@ -108,8 +109,8 @@ Book3S (aka sPAPR)
IBM BookE
---------
- - Software loaded TLB.
- - All 32 bit
+- Software loaded TLB.
+- All 32 bit::
+--------------+
| 401 |
@@ -155,8 +156,8 @@ IBM BookE
Motorola/Freescale 8xx
----------------------
- - Software loaded with hardware assist.
- - All 32 bit
+- Software loaded with hardware assist.
+- All 32 bit::
+-------------+
| MPC8xx Core |
@@ -166,9 +167,9 @@ Motorola/Freescale 8xx
Freescale BookE
---------------
- - Software loaded TLB.
- - e6500 adds HW loaded indirect TLB entries.
- - Mix of 32 & 64 bit
+- Software loaded TLB.
+- e6500 adds HW loaded indirect TLB entries.
+- Mix of 32 & 64 bit::
+--------------+
| e200 |
@@ -207,8 +208,8 @@ Freescale BookE
IBM A2 core
-----------
- - Book3E, software loaded TLB + HW loaded indirect TLB entries.
- - 64 bit
+- Book3E, software loaded TLB + HW loaded indirect TLB entries.
+- 64 bit::
+--------------+ +----------------+
| A2 core | --> | WSP |
diff --git a/Documentation/powerpc/cpu_features.txt b/Documentation/powerpc/cpu_features.rst
index ae09df8722c8..b7bcdd2f41bb 100644
--- a/Documentation/powerpc/cpu_features.txt
+++ b/Documentation/powerpc/cpu_features.rst
@@ -1,3 +1,7 @@
+============
+CPU Features
+============
+
Hollis Blanchard <hollis@austin.ibm.com>
5 Jun 2002
@@ -32,7 +36,7 @@ anyways).
After detecting the processor type, the kernel patches out sections of code
that shouldn't be used by writing nop's over it. Using cpufeatures requires
just 2 macros (found in arch/powerpc/include/asm/cputable.h), as seen in head.S
-transfer_to_handler:
+transfer_to_handler::
#ifdef CONFIG_ALTIVEC
BEGIN_FTR_SECTION
diff --git a/Documentation/powerpc/cxl.txt b/Documentation/powerpc/cxl.rst
index c5e8d5098ed3..920546d81326 100644
--- a/Documentation/powerpc/cxl.txt
+++ b/Documentation/powerpc/cxl.rst
@@ -1,3 +1,4 @@
+====================================
Coherent Accelerator Interface (CXL)
====================================
@@ -21,6 +22,8 @@ Introduction
Hardware overview
=================
+ ::
+
POWER8/9 FPGA
+----------+ +---------+
| | | |
@@ -59,14 +62,16 @@ Hardware overview
the fault. The context to which this fault is serviced is based on
who owns that acceleration function.
- POWER8 <-----> PSL Version 8 is compliant to the CAIA Version 1.0.
- POWER9 <-----> PSL Version 9 is compliant to the CAIA Version 2.0.
+ - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0.
+ - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0.
+
This PSL Version 9 provides new features such as:
+
* Interaction with the nest MMU on the P9 chip.
* Native DMA support.
* Supports sending ASB_Notify messages for host thread wakeup.
* Supports Atomic operations.
- * ....
+ * etc.
Cards with a PSL9 won't work on a POWER8 system and cards with a
PSL8 won't work on a POWER9 system.
@@ -147,7 +152,9 @@ User API
master devices.
A userspace library libcxl is available here:
+
https://github.com/ibm-capi/libcxl
+
This provides a C interface to this kernel API.
open
@@ -165,7 +172,8 @@ open
When all available contexts are allocated the open call will fail
and return -ENOSPC.
- Note: IRQs need to be allocated for each context, which may limit
+ Note:
+ IRQs need to be allocated for each context, which may limit
the number of contexts that can be created, and therefore
how many times the device can be opened. The POWER8 CAPP
supports 2040 IRQs and 3 are used by the kernel, so 2037 are
@@ -186,7 +194,9 @@ ioctl
updated as userspace allocates and frees memory. This ioctl
returns once the AFU context is started.
- Takes a pointer to a struct cxl_ioctl_start_work:
+ Takes a pointer to a struct cxl_ioctl_start_work
+
+ ::
struct cxl_ioctl_start_work {
__u64 flags;
@@ -269,7 +279,7 @@ read
The buffer passed to read() must be at least 4K bytes.
The result of the read will be a buffer of one or more events,
- each event is of type struct cxl_event, of varying size.
+ each event is of type struct cxl_event, of varying size::
struct cxl_event {
struct cxl_event_header header;
@@ -280,7 +290,9 @@ read
};
};
- The struct cxl_event_header is defined as:
+ The struct cxl_event_header is defined as
+
+ ::
struct cxl_event_header {
__u16 type;
@@ -307,7 +319,9 @@ read
For future extensions and padding.
If the event type is CXL_EVENT_AFU_INTERRUPT then the event
- structure is defined as:
+ structure is defined as
+
+ ::
struct cxl_event_afu_interrupt {
__u16 flags;
@@ -326,7 +340,9 @@ read
For future extensions and padding.
If the event type is CXL_EVENT_DATA_STORAGE then the event
- structure is defined as:
+ structure is defined as
+
+ ::
struct cxl_event_data_storage {
__u16 flags;
@@ -356,7 +372,9 @@ read
For future extensions
If the event type is CXL_EVENT_AFU_ERROR then the event structure
- is defined as:
+ is defined as
+
+ ::
struct cxl_event_afu_error {
__u16 flags;
@@ -393,15 +411,15 @@ open
ioctl
-----
-CXL_IOCTL_DOWNLOAD_IMAGE:
-CXL_IOCTL_VALIDATE_IMAGE:
+CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE:
Starts and controls flashing a new FPGA image. Partial
reconfiguration is not supported (yet), so the image must contain
a copy of the PSL and AFU(s). Since an image can be quite large,
the caller may have to iterate, splitting the image in smaller
chunks.
- Takes a pointer to a struct cxl_adapter_image:
+ Takes a pointer to a struct cxl_adapter_image::
+
struct cxl_adapter_image {
__u64 flags;
__u64 data;
@@ -442,7 +460,7 @@ Udev rules
The following udev rules could be used to create a symlink to the
most logical chardev to use in any programming mode (afuX.Yd for
dedicated, afuX.Ys for afu directed), since the API is virtually
- identical for each:
+ identical for each::
SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
diff --git a/Documentation/powerpc/cxlflash.txt b/Documentation/powerpc/cxlflash.rst
index a64bdaa0a1cf..cea67931b3b9 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.rst
@@ -1,3 +1,7 @@
+================================
+Coherent Accelerator (CXL) Flash
+================================
+
Introduction
============
@@ -28,7 +32,7 @@ Introduction
responsible for the initialization of the adapter, setting up the
special path for user space access, and performing error recovery. It
communicates directly the Flash Accelerator Functional Unit (AFU)
- as described in Documentation/powerpc/cxl.txt.
+ as described in Documentation/powerpc/cxl.rst.
The cxlflash driver supports two, mutually exclusive, modes of
operation at the device (LUN) level:
@@ -58,7 +62,7 @@ Overview
The CXL Flash Adapter Driver establishes a master context with the
AFU. It uses memory mapped I/O (MMIO) for this control and setup. The
- Adapter Problem Space Memory Map looks like this:
+ Adapter Problem Space Memory Map looks like this::
+-------------------------------+
| 512 * 64 KB User MMIO |
@@ -375,7 +379,7 @@ CXL Flash Driver Host IOCTLs
Each host adapter instance that is supported by the cxlflash driver
has a special character device associated with it to enable a set of
host management function. These character devices are hosted in a
- class dedicated for cxlflash and can be accessed via /dev/cxlflash/*.
+ class dedicated for cxlflash and can be accessed via `/dev/cxlflash/*`.
Applications can be written to perform various functions using the
host ioctl APIs below.
diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/dawr-power9.rst
index ecdbb076438c..c96ab6befd9c 100644
--- a/Documentation/powerpc/DAWR-POWER9.txt
+++ b/Documentation/powerpc/dawr-power9.rst
@@ -1,10 +1,11 @@
+=====================
DAWR issues on POWER9
-============================
+=====================
On POWER9 the Data Address Watchpoint Register (DAWR) can cause a checkstop
if it points to cache inhibited (CI) memory. Currently Linux has no way to
disinguish CI memory when configuring the DAWR, so (for now) the DAWR is
-disabled by this commit:
+disabled by this commit::
commit 9654153158d3e0684a1bdb76dbababdb7111d5a0
Author: Michael Neuling <mikey@neuling.org>
@@ -12,7 +13,7 @@ disabled by this commit:
powerpc: Disable DAWR in the base POWER9 CPU features
Technical Details:
-============================
+==================
DAWR has 6 different ways of being set.
1) ptrace
@@ -37,7 +38,7 @@ DAWR on the migration.
For xmon, the 'bd' command will return an error on P9.
Consequences for users
-============================
+======================
For GDB watchpoints (ie 'watch' command) on POWER9 bare metal , GDB
will accept the command. Unfortunately since there is no hardware
@@ -57,8 +58,8 @@ trapped in GDB. The watchpoint is remembered, so if the guest is
migrated back to the POWER8 host, it will start working again.
Force enabling the DAWR
-=============================
-Kernels (since ~v5.2) have an option to force enable the DAWR via:
+=======================
+Kernels (since ~v5.2) have an option to force enable the DAWR via::
echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
@@ -86,5 +87,7 @@ dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.
To double check the DAWR is working, run this kernel selftest:
+
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+
Any errors/failures/skips mean something is wrong.
diff --git a/Documentation/powerpc/dscr.txt b/Documentation/powerpc/dscr.rst
index ece300c64f76..2ab99006014c 100644
--- a/Documentation/powerpc/dscr.txt
+++ b/Documentation/powerpc/dscr.rst
@@ -1,5 +1,6 @@
- DSCR (Data Stream Control Register)
- ================================================
+===================================
+DSCR (Data Stream Control Register)
+===================================
DSCR register in powerpc allows user to have some control of prefetch of data
stream in the processor. Please refer to the ISA documents or related manual
@@ -10,14 +11,17 @@ user interface.
(A) Data Structures:
- (1) thread_struct:
+ (1) thread_struct::
+
dscr /* Thread DSCR value */
dscr_inherit /* Thread has changed default DSCR */
- (2) PACA:
+ (2) PACA::
+
dscr_default /* per-CPU DSCR default value */
- (3) sysfs.c:
+ (3) sysfs.c::
+
dscr_default /* System DSCR default value */
(B) Scheduler Changes:
@@ -35,8 +39,8 @@ user interface.
(C) SYSFS Interface:
- Global DSCR default: /sys/devices/system/cpu/dscr_default
- CPU specific DSCR default: /sys/devices/system/cpu/cpuN/dscr
+ - Global DSCR default: /sys/devices/system/cpu/dscr_default
+ - CPU specific DSCR default: /sys/devices/system/cpu/cpuN/dscr
Changing the global DSCR default in the sysfs will change all the CPU
specific DSCR defaults immediately in their PACA structures. Again if
diff --git a/Documentation/powerpc/eeh-pci-error-recovery.txt b/Documentation/powerpc/eeh-pci-error-recovery.rst
index 678189280bb4..438a87ebc095 100644
--- a/Documentation/powerpc/eeh-pci-error-recovery.txt
+++ b/Documentation/powerpc/eeh-pci-error-recovery.rst
@@ -1,10 +1,10 @@
+==========================
+PCI Bus EEH Error Recovery
+==========================
+Linas Vepstas <linas@austin.ibm.com>
- PCI Bus EEH Error Recovery
- --------------------------
- Linas Vepstas
- <linas@austin.ibm.com>
- 12 January 2005
+12 January 2005
Overview:
@@ -143,17 +143,17 @@ seen in /proc/ppc64/eeh (subject to change). Normally, almost
all of these occur during boot, when the PCI bus is scanned, where
a large number of 0xff reads are part of the bus scan procedure.
-If a frozen slot is detected, code in
-arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
-syslog (/var/log/messages). This stack trace has proven to be very
-useful to device-driver authors for finding out at what point the EEH
-error was detected, as the error itself usually occurs slightly
+If a frozen slot is detected, code in
+arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
+syslog (/var/log/messages). This stack trace has proven to be very
+useful to device-driver authors for finding out at what point the EEH
+error was detected, as the error itself usually occurs slightly
beforehand.
Next, it uses the Linux kernel notifier chain/work queue mechanism to
allow any interested parties to find out about the failure. Device
drivers, or other parts of the kernel, can use
-eeh_register_notifier(struct notifier_block *) to find out about EEH
+`eeh_register_notifier(struct notifier_block *)` to find out about EEH
events. The event will include a pointer to the pci device, the
device node and some state info. Receivers of the event can "do as
they wish"; the default handler will be described further in this
@@ -162,10 +162,13 @@ section.
To assist in the recovery of the device, eeh.c exports the
following functions:
-rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second
-rtas_configure_bridge() -- ask firmware to configure any PCI bridges
+rtas_set_slot_reset()
+ assert the PCI #RST line for 1/8th of a second
+rtas_configure_bridge()
+ ask firmware to configure any PCI bridges
located topologically under the pci slot.
-eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
+eeh_save_bars() and eeh_restore_bars():
+ save and restore the PCI
config-space info for a device and any devices under it.
@@ -191,7 +194,7 @@ events get delivered to user-space scripts.
Following is an example sequence of events that cause a device driver
close function to be called during the first phase of an EEH reset.
-The following sequence is an example of the pcnet32 device driver.
+The following sequence is an example of the pcnet32 device driver::
rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c
{
@@ -241,53 +244,54 @@ The following sequence is an example of the pcnet32 device driver.
}}}}}}
- in drivers/pci/pci_driver.c,
- struct device_driver->remove() is just pci_device_remove()
- which calls struct pci_driver->remove() which is pcnet32_remove_one()
- which calls unregister_netdev() (in net/core/dev.c)
- which calls dev_close() (in net/core/dev.c)
- which calls dev->stop() which is pcnet32_close()
- which then does the appropriate shutdown.
+in drivers/pci/pci_driver.c,
+struct device_driver->remove() is just pci_device_remove()
+which calls struct pci_driver->remove() which is pcnet32_remove_one()
+which calls unregister_netdev() (in net/core/dev.c)
+which calls dev_close() (in net/core/dev.c)
+which calls dev->stop() which is pcnet32_close()
+which then does the appropriate shutdown.
---
+
Following is the analogous stack trace for events sent to user-space
-when the pci device is unconfigured.
+when the pci device is unconfigured::
-rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
- calls
- pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
+ rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
calls
- pci_destroy_dev (struct pci_dev *) {
+ pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
calls
- device_unregister (&dev->dev) { // in /drivers/base/core.c
+ pci_destroy_dev (struct pci_dev *) {
calls
- device_del(struct device * dev) { // in /drivers/base/core.c
+ device_unregister (&dev->dev) { // in /drivers/base/core.c
calls
- kobject_del() { //in /libs/kobject.c
+ device_del(struct device * dev) { // in /drivers/base/core.c
calls
- kobject_uevent() { // in /libs/kobject.c
+ kobject_del() { //in /libs/kobject.c
calls
- kset_uevent() { // in /lib/kobject.c
+ kobject_uevent() { // in /libs/kobject.c
calls
- kset->uevent_ops->uevent() // which is really just
- a call to
- dev_uevent() { // in /drivers/base/core.c
+ kset_uevent() { // in /lib/kobject.c
calls
- dev->bus->uevent() which is really just a call to
- pci_uevent () { // in drivers/pci/hotplug.c
- which prints device name, etc....
+ kset->uevent_ops->uevent() // which is really just
+ a call to
+ dev_uevent() { // in /drivers/base/core.c
+ calls
+ dev->bus->uevent() which is really just a call to
+ pci_uevent () { // in drivers/pci/hotplug.c
+ which prints device name, etc....
+ }
}
- }
- then kobject_uevent() sends a netlink uevent to userspace
- --> userspace uevent
- (during early boot, nobody listens to netlink events and
- kobject_uevent() executes uevent_helper[], which runs the
- event process /sbin/hotplug)
+ then kobject_uevent() sends a netlink uevent to userspace
+ --> userspace uevent
+ (during early boot, nobody listens to netlink events and
+ kobject_uevent() executes uevent_helper[], which runs the
+ event process /sbin/hotplug)
+ }
}
- }
- kobject_del() then calls sysfs_remove_dir(), which would
- trigger any user-space daemon that was watching /sysfs,
- and notice the delete event.
+ kobject_del() then calls sysfs_remove_dir(), which would
+ trigger any user-space daemon that was watching /sysfs,
+ and notice the delete event.
Pro's and Con's of the Current Design
@@ -299,12 +303,12 @@ individual device drivers, so that the current design throws a wide net.
The biggest negative of the design is that it potentially disturbs
network daemons and file systems that didn't need to be disturbed.
--- A minor complaint is that resetting the network card causes
+- A minor complaint is that resetting the network card causes
user-space back-to-back ifdown/ifup burps that potentially disturb
network daemons, that didn't need to even know that the pci
card was being rebooted.
--- A more serious concern is that the same reset, for SCSI devices,
+- A more serious concern is that the same reset, for SCSI devices,
causes havoc to mounted file systems. Scripts cannot post-facto
unmount a file system without flushing pending buffers, but this
is impossible, because I/O has already been stopped. Thus,
@@ -322,7 +326,7 @@ network daemons and file systems that didn't need to be disturbed.
from the block layer. It would be very natural to add an EEH
reset into this chain of events.
--- If a SCSI error occurs for the root device, all is lost unless
+- If a SCSI error occurs for the root device, all is lost unless
the sysadmin had the foresight to run /bin, /sbin, /etc, /var
and so on, out of ramdisk/tmpfs.
@@ -330,5 +334,3 @@ network daemons and file systems that didn't need to be disturbed.
Conclusions
-----------
There's forward progress ...
-
-
diff --git a/Documentation/powerpc/elfnote.rst b/Documentation/powerpc/elfnote.rst
new file mode 100644
index 000000000000..06602248621c
--- /dev/null
+++ b/Documentation/powerpc/elfnote.rst
@@ -0,0 +1,41 @@
+==========================
+ELF Note PowerPC Namespace
+==========================
+
+The PowerPC namespace in an ELF Note of the kernel binary is used to store
+capabilities and information which can be used by a bootloader or userland.
+
+Types and Descriptors
+---------------------
+
+The types to be used with the "PowerPC" namesapce are defined in [#f1]_.
+
+ 1) PPC_ELFNOTE_CAPABILITIES
+
+Define the capabilities supported/required by the kernel. This type uses a
+bitmap as "descriptor" field. Each bit is described below:
+
+- Ultravisor-capable bit (PowerNV only).
+
+.. code-block:: c
+
+ #define PPCCAP_ULTRAVISOR_BIT (1 << 0)
+
+Indicate that the powerpc kernel binary knows how to run in an
+ultravisor-enabled system.
+
+In an ultravisor-enabled system, some machine resources are now controlled
+by the ultravisor. If the kernel is not ultravisor-capable, but it ends up
+being run on a machine with ultravisor, the kernel will probably crash
+trying to access ultravisor resources. For instance, it may crash in early
+boot trying to set the partition table entry 0.
+
+In an ultravisor-enabled system, a bootloader could warn the user or prevent
+the kernel from being run if the PowerPC ultravisor capability doesn't exist
+or the Ultravisor-capable bit is not set.
+
+References
+----------
+
+.. [#f1] arch/powerpc/include/asm/elfnote.h
+
diff --git a/Documentation/powerpc/firmware-assisted-dump.rst b/Documentation/powerpc/firmware-assisted-dump.rst
new file mode 100644
index 000000000000..0455a78486d5
--- /dev/null
+++ b/Documentation/powerpc/firmware-assisted-dump.rst
@@ -0,0 +1,361 @@
+======================
+Firmware-Assisted Dump
+======================
+
+July 2011
+
+The goal of firmware-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+- Firmware-Assisted Dump (FADump) infrastructure is intended to replace
+ the existing phyp assisted dump.
+- Fadump uses the same firmware interfaces and memory reservation model
+ as phyp assisted dump.
+- Unlike phyp dump, FADump exports the memory dump through /proc/vmcore
+ in the ELF format in the same way as kdump. This helps us reuse the
+ kdump infrastructure for dump capture and filtering.
+- Unlike phyp dump, userspace tool does not need to refer any sysfs
+ interface while reading /proc/vmcore.
+- Unlike phyp dump, FADump allows user to release all the memory reserved
+ for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
+- Once enabled through kernel boot parameter, FADump can be
+ started/stopped through /sys/kernel/fadump_registered interface (see
+ sysfs files section below) and can be easily integrated with kdump
+ service start/stop init scripts.
+
+Comparing with kdump or other strategies, firmware-assisted
+dump offers several strong, practical advantages:
+
+- Unlike kdump, the system has been reset, and loaded
+ with a fresh copy of the kernel. In particular,
+ PCI and I/O devices have been reinitialized and are
+ in a clean, consistent state.
+- Once the dump is copied out, the memory that held the dump
+ is immediately available to the running kernel. And therefore,
+ unlike kdump, FADump doesn't need a 2nd reboot to get back
+ the system to the production configuration.
+
+The above can only be accomplished by coordination with,
+and assistance from the Power firmware. The procedure is
+as follows:
+
+- The first kernel registers the sections of memory with the
+ Power firmware for dump preservation during OS initialization.
+ These registered sections of memory are reserved by the first
+ kernel during early boot.
+
+- When system crashes, the Power firmware will copy the registered
+ low memory regions (boot memory) from source to destination area.
+ It will also save hardware PTE's.
+
+ NOTE:
+ The term 'boot memory' means size of the low memory chunk
+ that is required for a kernel to boot successfully when
+ booted with restricted memory. By default, the boot memory
+ size will be the larger of 5% of system RAM or 256MB.
+ Alternatively, user can also specify boot memory size
+ through boot parameter 'crashkernel=' which will override
+ the default calculated size. Use this option if default
+ boot memory size is not sufficient for second kernel to
+ boot successfully. For syntax of crashkernel= parameter,
+ refer to Documentation/admin-guide/kdump/kdump.rst. If any
+ offset is provided in crashkernel= parameter, it will be
+ ignored as FADump uses a predefined offset to reserve memory
+ for boot memory dump preservation in case of a crash.
+
+- After the low memory (boot memory) area has been saved, the
+ firmware will reset PCI and other hardware state. It will
+ *not* clear the RAM. It will then launch the bootloader, as
+ normal.
+
+- The freshly booted kernel will notice that there is a new node
+ (rtas/ibm,kernel-dump on pSeries or ibm,opal/dump/mpipl-boot
+ on OPAL platform) in the device tree, indicating that
+ there is crash data available from a previous boot. During
+ the early boot OS will reserve rest of the memory above
+ boot memory size effectively booting with restricted memory
+ size. This will make sure that this kernel (also, referred
+ to as second kernel or capture kernel) will not touch any
+ of the dump memory area.
+
+- User-space tools will read /proc/vmcore to obtain the contents
+ of memory, which holds the previous crashed kernel dump in ELF
+ format. The userspace tools may copy this info to disk, or
+ network, nas, san, iscsi, etc. as desired.
+
+- Once the userspace tool is done saving dump, it will echo
+ '1' to /sys/kernel/fadump_release_mem to release the reserved
+ memory back to general use, except the memory required for
+ next firmware-assisted dump registration.
+
+ e.g.::
+
+ # echo 1 > /sys/kernel/fadump_release_mem
+
+Please note that the firmware-assisted dump feature
+is only available on POWER6 and above systems on pSeries
+(PowerVM) platform and POWER9 and above systems with OP940
+or later firmware versions on PowerNV (OPAL) platform.
+Note that, OPAL firmware exports ibm,opal/dump node when
+FADump is supported on PowerNV platform.
+
+On OPAL based machines, system first boots into an intermittent
+kernel (referred to as petitboot kernel) before booting into the
+capture kernel. This kernel would have minimal kernel and/or
+userspace support to process crash data. Such kernel needs to
+preserve previously crash'ed kernel's memory for the subsequent
+capture kernel boot to process this crash data. Kernel config
+option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
+to ensure that crash data is preserved to process later.
+
+-- On OPAL based machines (PowerNV), if the kernel is build with
+ CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
+ exported as /sys/firmware/opal/core file. This procfs file is
+ helpful in debugging OPAL crashes with GDB. The kernel memory
+ used for exporting this procfs file can be released by echo'ing
+ '1' to /sys/kernel/fadump_release_opalcore node.
+
+ e.g.
+ # echo 1 > /sys/kernel/fadump_release_opalcore
+
+Implementation details:
+-----------------------
+
+During boot, a check is made to see if firmware supports
+this feature on that particular machine. If it does, then
+we check to see if an active dump is waiting for us. If yes
+then everything but boot memory size of RAM is reserved during
+early boot (See Fig. 2). This area is released once we finish
+collecting the dump from user land scripts (e.g. kdump scripts)
+that are run. If there is dump data, then the
+/sys/kernel/fadump_release_mem file is created, and the reserved
+memory is held.
+
+If there is no waiting dump data, then only the memory required to
+hold CPU state, HPTE region, boot memory dump, FADump header and
+elfcore header, is usually reserved at an offset greater than boot
+memory size (see Fig. 1). This area is *not* released: this region
+will be kept permanently reserved, so that it can act as a receptacle
+for a copy of the boot memory content in addition to CPU state and
+HPTE region, in the case a crash does occur.
+
+Since this reserved memory area is used only after the system crash,
+there is no point in blocking this significant chunk of memory from
+production kernel. Hence, the implementation uses the Linux kernel's
+Contiguous Memory Allocator (CMA) for memory reservation if CMA is
+configured for kernel. With CMA reservation this memory will be
+available for applications to use it, while kernel is prevented from
+using it. With this FADump will still be able to capture all of the
+kernel memory and most of the user space memory except the user pages
+that were present in CMA region::
+
+ o Memory Reservation during first kernel
+
+ Low memory Top of memory
+ 0 boot memory size |<--- Reserved dump area --->| |
+ | | | Permanent Reservation | |
+ V V | | V
+ +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
+ | | |///|////| DUMP | HDR | ELF |////| |
+ +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
+ | ^ ^ ^ ^ ^
+ | | | | | |
+ \ CPU HPTE / | |
+ ------------------------------ | |
+ Boot memory content gets transferred | |
+ to reserved area by firmware at the | |
+ time of crash. | |
+ FADump Header |
+ (meta area) |
+ |
+ |
+ Metadata: This area holds a metadata struture whose
+ address is registered with f/w and retrieved in the
+ second kernel after crash, on platforms that support
+ tags (OPAL). Having such structure with info needed
+ to process the crashdump eases dump capture process.
+
+ Fig. 1
+
+
+ o Memory Reservation during second kernel after crash
+
+ Low memory Top of memory
+ 0 boot memory size |
+ | |<------------ Crash preserved area ------------>|
+ V V |<--- Reserved dump area --->| |
+ +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
+ | | |///|////| DUMP | HDR | ELF |////| |
+ +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
+ | |
+ V V
+ Used by second /proc/vmcore
+ kernel to boot
+
+ +---+
+ |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above
+ +---+ figures are not always present. For example, OPAL platform
+ does not have CPU & HPTE regions while Metadata region is
+ not supported on pSeries currently.
+
+ Fig. 2
+
+
+Currently the dump will be copied from /proc/vmcore to a new file upon
+user intervention. The dump data available through /proc/vmcore will be
+in ELF format. Hence the existing kdump infrastructure (kdump scripts)
+to save the dump works fine with minor modifications. KDump scripts on
+major Distro releases have already been modified to work seemlessly (no
+user intervention in saving the dump) when FADump is used, instead of
+KDump, as dump mechanism.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+How to enable firmware-assisted dump (FADump):
+----------------------------------------------
+
+1. Set config option CONFIG_FA_DUMP=y and build kernel.
+2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
+ By default, FADump reserved memory will be initialized as CMA area.
+ Alternatively, user can boot linux kernel with 'fadump=nocma' to
+ prevent FADump to use CMA.
+3. Optionally, user can also set 'crashkernel=' kernel cmdline
+ to specify size of the memory to reserve for boot memory dump
+ preservation.
+
+NOTE:
+ 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead
+ use 'crashkernel=' to specify size of the memory to reserve
+ for boot memory dump preservation.
+ 2. If firmware-assisted dump fails to reserve memory then it
+ will fallback to existing kdump mechanism if 'crashkernel='
+ option is set at kernel cmdline.
+ 3. if user wants to capture all of user space memory and ok with
+ reserved memory not available to production system, then
+ 'fadump=nocma' kernel parameter can be used to fallback to
+ old behaviour.
+
+Sysfs/debugfs files:
+--------------------
+
+Firmware-assisted dump feature uses sysfs file system to hold
+the control files and debugfs file to display memory reserved region.
+
+Here is the list of files under kernel sysfs:
+
+ /sys/kernel/fadump_enabled
+ This is used to display the FADump status.
+
+ - 0 = FADump is disabled
+ - 1 = FADump is enabled
+
+ This interface can be used by kdump init scripts to identify if
+ FADump is enabled in the kernel and act accordingly.
+
+ /sys/kernel/fadump_registered
+ This is used to display the FADump registration status as well
+ as to control (start/stop) the FADump registration.
+
+ - 0 = FADump is not registered.
+ - 1 = FADump is registered and ready to handle system crash.
+
+ To register FADump echo 1 > /sys/kernel/fadump_registered and
+ echo 0 > /sys/kernel/fadump_registered for un-register and stop the
+ FADump. Once the FADump is un-registered, the system crash will not
+ be handled and vmcore will not be captured. This interface can be
+ easily integrated with kdump service start/stop.
+
+ /sys/kernel/fadump_release_mem
+ This file is available only when FADump is active during
+ second kernel. This is used to release the reserved memory
+ region that are held for saving crash dump. To release the
+ reserved memory echo 1 to it::
+
+ echo 1 > /sys/kernel/fadump_release_mem
+
+ After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
+ file will change to reflect the new memory reservations.
+
+ The existing userspace tools (kdump infrastructure) can be easily
+ enhanced to use this interface to release the memory reserved for
+ dump and continue without 2nd reboot.
+
+ /sys/kernel/fadump_release_opalcore
+
+ This file is available only on OPAL based machines when FADump is
+ active during capture kernel. This is used to release the memory
+ used by the kernel to export /sys/firmware/opal/core file. To
+ release this memory, echo '1' to it:
+
+ echo 1 > /sys/kernel/fadump_release_opalcore
+
+Here is the list of files under powerpc debugfs:
+(Assuming debugfs is mounted on /sys/kernel/debug directory.)
+
+ /sys/kernel/debug/powerpc/fadump_region
+ This file shows the reserved memory regions if FADump is
+ enabled otherwise this file is empty. The output format
+ is::
+
+ <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
+
+ and for kernel DUMP region is:
+
+ DUMP: Src: <src-addr>, Dest: <dest-addr>, Size: <size>, Dumped: # bytes
+
+ e.g.
+ Contents when FADump is registered during first kernel::
+
+ # cat /sys/kernel/debug/powerpc/fadump_region
+ CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
+ HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
+ DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
+
+ Contents when FADump is active during second kernel::
+
+ # cat /sys/kernel/debug/powerpc/fadump_region
+ CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
+ HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
+ DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
+ : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
+
+
+NOTE:
+ Please refer to Documentation/filesystems/debugfs.txt on
+ how to mount the debugfs filesystem.
+
+
+TODO:
+-----
+ - Need to come up with the better approach to find out more
+ accurate boot memory size that is required for a kernel to
+ boot successfully when booted with restricted memory.
+ - The FADump implementation introduces a FADump crash info structure
+ in the scratch area before the ELF core header. The idea of introducing
+ this structure is to pass some important crash info data to the second
+ kernel which will help second kernel to populate ELF core header with
+ correct data before it gets exported through /proc/vmcore. The current
+ design implementation does not address a possibility of introducing
+ additional fields (in future) to this structure without affecting
+ compatibility. Need to come up with the better approach to address this.
+
+ The possible approaches are:
+
+ 1. Introduce version field for version tracking, bump up the version
+ whenever a new field is added to the structure in future. The version
+ field can be used to find out what fields are valid for the current
+ version of the structure.
+ 2. Reserve the area of predefined size (say PAGE_SIZE) for this
+ structure and have unused area as reserved (initialized to zero)
+ for future field additions.
+
+ The advantage of approach 1 over 2 is we don't need to reserve extra space.
+
+Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
+
+This document is based on the original documentation written for phyp
+
+assisted dump by Linas Vepstas and Manish Ahuja.
diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt
deleted file mode 100644
index 18c5feef2577..000000000000
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ /dev/null
@@ -1,292 +0,0 @@
-
- Firmware-Assisted Dump
- ------------------------
- July 2011
-
-The goal of firmware-assisted dump is to enable the dump of
-a crashed system, and to do so from a fully-reset system, and
-to minimize the total elapsed time until the system is back
-in production use.
-
-- Firmware assisted dump (fadump) infrastructure is intended to replace
- the existing phyp assisted dump.
-- Fadump uses the same firmware interfaces and memory reservation model
- as phyp assisted dump.
-- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
- in the ELF format in the same way as kdump. This helps us reuse the
- kdump infrastructure for dump capture and filtering.
-- Unlike phyp dump, userspace tool does not need to refer any sysfs
- interface while reading /proc/vmcore.
-- Unlike phyp dump, fadump allows user to release all the memory reserved
- for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
-- Once enabled through kernel boot parameter, fadump can be
- started/stopped through /sys/kernel/fadump_registered interface (see
- sysfs files section below) and can be easily integrated with kdump
- service start/stop init scripts.
-
-Comparing with kdump or other strategies, firmware-assisted
-dump offers several strong, practical advantages:
-
--- Unlike kdump, the system has been reset, and loaded
- with a fresh copy of the kernel. In particular,
- PCI and I/O devices have been reinitialized and are
- in a clean, consistent state.
--- Once the dump is copied out, the memory that held the dump
- is immediately available to the running kernel. And therefore,
- unlike kdump, fadump doesn't need a 2nd reboot to get back
- the system to the production configuration.
-
-The above can only be accomplished by coordination with,
-and assistance from the Power firmware. The procedure is
-as follows:
-
--- The first kernel registers the sections of memory with the
- Power firmware for dump preservation during OS initialization.
- These registered sections of memory are reserved by the first
- kernel during early boot.
-
--- When a system crashes, the Power firmware will save
- the low memory (boot memory of size larger of 5% of system RAM
- or 256MB) of RAM to the previous registered region. It will
- also save system registers, and hardware PTE's.
-
- NOTE: The term 'boot memory' means size of the low memory chunk
- that is required for a kernel to boot successfully when
- booted with restricted memory. By default, the boot memory
- size will be the larger of 5% of system RAM or 256MB.
- Alternatively, user can also specify boot memory size
- through boot parameter 'crashkernel=' which will override
- the default calculated size. Use this option if default
- boot memory size is not sufficient for second kernel to
- boot successfully. For syntax of crashkernel= parameter,
- refer to Documentation/kdump/kdump.txt. If any offset is
- provided in crashkernel= parameter, it will be ignored
- as fadump uses a predefined offset to reserve memory
- for boot memory dump preservation in case of a crash.
-
--- After the low memory (boot memory) area has been saved, the
- firmware will reset PCI and other hardware state. It will
- *not* clear the RAM. It will then launch the bootloader, as
- normal.
-
--- The freshly booted kernel will notice that there is a new
- node (ibm,dump-kernel) in the device tree, indicating that
- there is crash data available from a previous boot. During
- the early boot OS will reserve rest of the memory above
- boot memory size effectively booting with restricted memory
- size. This will make sure that the second kernel will not
- touch any of the dump memory area.
-
--- User-space tools will read /proc/vmcore to obtain the contents
- of memory, which holds the previous crashed kernel dump in ELF
- format. The userspace tools may copy this info to disk, or
- network, nas, san, iscsi, etc. as desired.
-
--- Once the userspace tool is done saving dump, it will echo
- '1' to /sys/kernel/fadump_release_mem to release the reserved
- memory back to general use, except the memory required for
- next firmware-assisted dump registration.
-
- e.g.
- # echo 1 > /sys/kernel/fadump_release_mem
-
-Please note that the firmware-assisted dump feature
-is only available on Power6 and above systems with recent
-firmware versions.
-
-Implementation details:
-----------------------
-
-During boot, a check is made to see if firmware supports
-this feature on that particular machine. If it does, then
-we check to see if an active dump is waiting for us. If yes
-then everything but boot memory size of RAM is reserved during
-early boot (See Fig. 2). This area is released once we finish
-collecting the dump from user land scripts (e.g. kdump scripts)
-that are run. If there is dump data, then the
-/sys/kernel/fadump_release_mem file is created, and the reserved
-memory is held.
-
-If there is no waiting dump data, then only the memory required
-to hold CPU state, HPTE region, boot memory dump and elfcore
-header, is usually reserved at an offset greater than boot memory
-size (see Fig. 1). This area is *not* released: this region will
-be kept permanently reserved, so that it can act as a receptacle
-for a copy of the boot memory content in addition to CPU state
-and HPTE region, in the case a crash does occur. Since this reserved
-memory area is used only after the system crash, there is no point in
-blocking this significant chunk of memory from production kernel.
-Hence, the implementation uses the Linux kernel's Contiguous Memory
-Allocator (CMA) for memory reservation if CMA is configured for kernel.
-With CMA reservation this memory will be available for applications to
-use it, while kernel is prevented from using it. With this fadump will
-still be able to capture all of the kernel memory and most of the user
-space memory except the user pages that were present in CMA region.
-
- o Memory Reservation during first kernel
-
- Low memory Top of memory
- 0 boot memory size |
- | | |<--Reserved dump area -->| |
- V V | Permanent Reservation | V
- +-----------+----------/ /---+---+----+-----------+----+------+
- | | |CPU|HPTE| DUMP |ELF | |
- +-----------+----------/ /---+---+----+-----------+----+------+
- | ^
- | |
- \ /
- -------------------------------------------
- Boot memory content gets transferred to
- reserved area by firmware at the time of
- crash
- Fig. 1
-
- o Memory Reservation during second kernel after crash
-
- Low memory Top of memory
- 0 boot memory size |
- | |<------------- Reserved dump area ----------- -->|
- V V V
- +-----------+----------/ /---+---+----+-----------+----+------+
- | | |CPU|HPTE| DUMP |ELF | |
- +-----------+----------/ /---+---+----+-----------+----+------+
- | |
- V V
- Used by second /proc/vmcore
- kernel to boot
- Fig. 2
-
-Currently the dump will be copied from /proc/vmcore to a
-a new file upon user intervention. The dump data available through
-/proc/vmcore will be in ELF format. Hence the existing kdump
-infrastructure (kdump scripts) to save the dump works fine with
-minor modifications.
-
-The tools to examine the dump will be same as the ones
-used for kdump.
-
-How to enable firmware-assisted dump (fadump):
--------------------------------------
-
-1. Set config option CONFIG_FA_DUMP=y and build kernel.
-2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
- By default, fadump reserved memory will be initialized as CMA area.
- Alternatively, user can boot linux kernel with 'fadump=nocma' to
- prevent fadump to use CMA.
-3. Optionally, user can also set 'crashkernel=' kernel cmdline
- to specify size of the memory to reserve for boot memory dump
- preservation.
-
-NOTE: 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead
- use 'crashkernel=' to specify size of the memory to reserve
- for boot memory dump preservation.
- 2. If firmware-assisted dump fails to reserve memory then it
- will fallback to existing kdump mechanism if 'crashkernel='
- option is set at kernel cmdline.
- 3. if user wants to capture all of user space memory and ok with
- reserved memory not available to production system, then
- 'fadump=nocma' kernel parameter can be used to fallback to
- old behaviour.
-
-Sysfs/debugfs files:
-------------
-
-Firmware-assisted dump feature uses sysfs file system to hold
-the control files and debugfs file to display memory reserved region.
-
-Here is the list of files under kernel sysfs:
-
- /sys/kernel/fadump_enabled
-
- This is used to display the fadump status.
- 0 = fadump is disabled
- 1 = fadump is enabled
-
- This interface can be used by kdump init scripts to identify if
- fadump is enabled in the kernel and act accordingly.
-
- /sys/kernel/fadump_registered
-
- This is used to display the fadump registration status as well
- as to control (start/stop) the fadump registration.
- 0 = fadump is not registered.
- 1 = fadump is registered and ready to handle system crash.
-
- To register fadump echo 1 > /sys/kernel/fadump_registered and
- echo 0 > /sys/kernel/fadump_registered for un-register and stop the
- fadump. Once the fadump is un-registered, the system crash will not
- be handled and vmcore will not be captured. This interface can be
- easily integrated with kdump service start/stop.
-
- /sys/kernel/fadump_release_mem
-
- This file is available only when fadump is active during
- second kernel. This is used to release the reserved memory
- region that are held for saving crash dump. To release the
- reserved memory echo 1 to it:
-
- echo 1 > /sys/kernel/fadump_release_mem
-
- After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
- file will change to reflect the new memory reservations.
-
- The existing userspace tools (kdump infrastructure) can be easily
- enhanced to use this interface to release the memory reserved for
- dump and continue without 2nd reboot.
-
-Here is the list of files under powerpc debugfs:
-(Assuming debugfs is mounted on /sys/kernel/debug directory.)
-
- /sys/kernel/debug/powerpc/fadump_region
-
- This file shows the reserved memory regions if fadump is
- enabled otherwise this file is empty. The output format
- is:
- <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
-
- e.g.
- Contents when fadump is registered during first kernel
-
- # cat /sys/kernel/debug/powerpc/fadump_region
- CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
- HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
- DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
-
- Contents when fadump is active during second kernel
-
- # cat /sys/kernel/debug/powerpc/fadump_region
- CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
- HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
- DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
- : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
-
-NOTE: Please refer to Documentation/filesystems/debugfs.txt on
- how to mount the debugfs filesystem.
-
-
-TODO:
------
- o Need to come up with the better approach to find out more
- accurate boot memory size that is required for a kernel to
- boot successfully when booted with restricted memory.
- o The fadump implementation introduces a fadump crash info structure
- in the scratch area before the ELF core header. The idea of introducing
- this structure is to pass some important crash info data to the second
- kernel which will help second kernel to populate ELF core header with
- correct data before it gets exported through /proc/vmcore. The current
- design implementation does not address a possibility of introducing
- additional fields (in future) to this structure without affecting
- compatibility. Need to come up with the better approach to address this.
- The possible approaches are:
- 1. Introduce version field for version tracking, bump up the version
- whenever a new field is added to the structure in future. The version
- field can be used to find out what fields are valid for the current
- version of the structure.
- 2. Reserve the area of predefined size (say PAGE_SIZE) for this
- structure and have unused area as reserved (initialized to zero)
- for future field additions.
- The advantage of approach 1 over 2 is we don't need to reserve extra space.
----
-Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
-This document is based on the original documentation written for phyp
-assisted dump by Linas Vepstas and Manish Ahuja.
diff --git a/Documentation/powerpc/hvcs.txt b/Documentation/powerpc/hvcs.rst
index a730ca5a07f8..6808acde672f 100644
--- a/Documentation/powerpc/hvcs.txt
+++ b/Documentation/powerpc/hvcs.rst
@@ -1,19 +1,22 @@
-===========================================================================
- HVCS
- IBM "Hypervisor Virtual Console Server" Installation Guide
- for Linux Kernel 2.6.4+
- Copyright (C) 2004 IBM Corporation
+===============================================================
+HVCS IBM "Hypervisor Virtual Console Server" Installation Guide
+===============================================================
-===========================================================================
-NOTE:Eight space tabs are the optimum editor setting for reading this file.
-===========================================================================
+for Linux Kernel 2.6.4+
- Author(s) : Ryan S. Arnold <rsa@us.ibm.com>
- Date Created: March, 02, 2004
- Last Changed: August, 24, 2004
+Copyright (C) 2004 IBM Corporation
----------------------------------------------------------------------------
-Table of contents:
+.. ===========================================================================
+.. NOTE:Eight space tabs are the optimum editor setting for reading this file.
+.. ===========================================================================
+
+
+Author(s): Ryan S. Arnold <rsa@us.ibm.com>
+
+Date Created: March, 02, 2004
+Last Changed: August, 24, 2004
+
+.. Table of contents:
1. Driver Introduction:
2. System Requirements
@@ -27,8 +30,8 @@ Table of contents:
8. Questions & Answers:
9. Reporting Bugs:
----------------------------------------------------------------------------
1. Driver Introduction:
+=======================
This is the device driver for the IBM Hypervisor Virtual Console Server,
"hvcs". The IBM hvcs provides a tty driver interface to allow Linux user
@@ -38,8 +41,8 @@ ppc64 system. Physical hardware consoles per partition are not practical
on this hardware so system consoles are accessed by this driver using
firmware interfaces to virtual terminal devices.
----------------------------------------------------------------------------
2. System Requirements:
+=======================
This device driver was written using 2.6.4 Linux kernel APIs and will only
build and run on kernels of this version or later.
@@ -52,8 +55,8 @@ Sysfs must be mounted on the system so that the user can determine which
major and minor numbers are associated with each vty-server. Directions
for sysfs mounting are outside the scope of this document.
----------------------------------------------------------------------------
3. Build Options:
+=================
The hvcs driver registers itself as a tty driver. The tty layer
dynamically allocates a block of major and minor numbers in a quantity
@@ -65,11 +68,11 @@ If the default number of device entries is adequate then this driver can be
built into the kernel. If not, the default can be over-ridden by inserting
the driver as a module with insmod parameters.
----------------------------------------------------------------------------
3.1 Built-in:
+-------------
The following menuconfig example demonstrates selecting to build this
-driver into the kernel.
+driver into the kernel::
Device Drivers --->
Character devices --->
@@ -77,11 +80,11 @@ driver into the kernel.
Begin the kernel make process.
----------------------------------------------------------------------------
3.2 Module:
+-----------
The following menuconfig example demonstrates selecting to build this
-driver as a kernel module.
+driver as a kernel module::
Device Drivers --->
Character devices --->
@@ -89,11 +92,11 @@ driver as a kernel module.
The make process will build the following kernel modules:
- hvcs.ko
- hvcserver.ko
+ - hvcs.ko
+ - hvcserver.ko
To insert the module with the default allocation execute the following
-commands in the order they appear:
+commands in the order they appear::
insmod hvcserver.ko
insmod hvcs.ko
@@ -103,7 +106,7 @@ be inserted first, otherwise the hvcs module will not find some of the
symbols it expects.
To override the default use an insmod parameter as follows (requesting 4
-tty devices as an example):
+tty devices as an example)::
insmod hvcs.ko hvcs_parm_num_devs=4
@@ -115,31 +118,31 @@ source file before building.
NOTE: The length of time it takes to insmod the driver seems to be related
to the number of tty interfaces the registering driver requests.
-In order to remove the driver module execute the following command:
+In order to remove the driver module execute the following command::
rmmod hvcs.ko
The recommended method for installing hvcs as a module is to use depmod to
build a current modules.dep file in /lib/modules/`uname -r` and then
-execute:
+execute::
-modprobe hvcs hvcs_parm_num_devs=4
+ modprobe hvcs hvcs_parm_num_devs=4
The modules.dep file indicates that hvcserver.ko needs to be inserted
before hvcs.ko and modprobe uses this file to smartly insert the modules in
the proper order.
The following modprobe command is used to remove hvcs and hvcserver in the
-proper order:
+proper order::
-modprobe -r hvcs
+ modprobe -r hvcs
----------------------------------------------------------------------------
4. Installation:
+================
The tty layer creates sysfs entries which contain the major and minor
numbers allocated for the hvcs driver. The following snippet of "tree"
-output of the sysfs directory shows where these numbers are presented:
+output of the sysfs directory shows where these numbers are presented::
sys/
|-- *other sysfs base dirs*
@@ -164,7 +167,7 @@ output of the sysfs directory shows where these numbers are presented:
|-- *other sysfs base dirs*
For the above examples the following output is a result of cat'ing the
-"dev" entry in the hvcs directory:
+"dev" entry in the hvcs directory::
Pow5:/sys/class/tty/hvcs0/ # cat dev
254:0
@@ -184,7 +187,7 @@ systems running hvcs will already have the device entries created or udev
will do it automatically.
Given the example output above, to manually create a /dev/hvcs* node entry
-mknod can be used as follows:
+mknod can be used as follows::
mknod /dev/hvcs0 c 254 0
mknod /dev/hvcs1 c 254 1
@@ -195,15 +198,15 @@ Using mknod to manually create the device entries makes these device nodes
persistent. Once created they will exist prior to the driver insmod.
Attempting to connect an application to /dev/hvcs* prior to insertion of
-the hvcs module will result in an error message similar to the following:
+the hvcs module will result in an error message similar to the following::
"/dev/hvcs*: No such device".
NOTE: Just because there is a device node present doesn't mean that there
is a vty-server device configured for that node.
----------------------------------------------------------------------------
5. Connection
+=============
Since this driver controls devices that provide a tty interface a user can
interact with the device node entries using any standard tty-interactive
@@ -249,7 +252,7 @@ vty-server adapter is associated with which /dev/hvcs* node a special sysfs
attribute has been added to each vty-server sysfs entry. This entry is
called "index" and showing it reveals an integer that refers to the
/dev/hvcs* entry to use to connect to that device. For instance cating the
-index attribute of vty-server adapter 30000004 shows the following.
+index attribute of vty-server adapter 30000004 shows the following::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat index
2
@@ -262,8 +265,8 @@ system the /dev/hvcs* entry that interacts with a particular vty-server
adapter is not guaranteed to remain the same across system reboots. Look
in the Q & A section for more on this issue.
----------------------------------------------------------------------------
6. Disconnection
+================
As a security feature to prevent the delivery of stale data to an
unintended target the Power5 system firmware disables the fetching of data
@@ -305,7 +308,7 @@ connection between the vty-server and target vty ONLY if the vterm_state
previously read '1'. The write directive is ignored if the vterm_state
read '0' or if any value other than '0' was written to the vterm_state
attribute. The following example will show the method used for verifying
-the vty-server connection status and disconnecting a vty-server connection.
+the vty-server connection status and disconnecting a vty-server connection::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat vterm_state
1
@@ -318,12 +321,12 @@ the vty-server connection status and disconnecting a vty-server connection.
All vty-server connections are automatically terminated when the device is
hotplug removed and when the module is removed.
----------------------------------------------------------------------------
7. Configuration
+================
Each vty-server has a sysfs entry in the /sys/devices/vio directory, which
is symlinked in several other sysfs tree directories, notably under the
-hvcs driver entry, which looks like the following example:
+hvcs driver entry, which looks like the following example::
Pow5:/sys/bus/vio/drivers/hvcs # ls
. .. 30000003 30000004 rescan
@@ -344,7 +347,7 @@ completed or was never executed.
Vty-server entries in this directory are a 32 bit partition unique unit
address that is created by firmware. An example vty-server sysfs entry
-looks like the following:
+looks like the following::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # ls
. current_vty devspec name partner_vtys
@@ -352,21 +355,21 @@ looks like the following:
Each entry is provided, by default with a "name" attribute. Reading the
"name" attribute will reveal the device type as shown in the following
-example:
+example::
Pow5:/sys/bus/vio/drivers/hvcs/30000003 # cat name
vty-server
Each entry is also provided, by default, with a "devspec" attribute which
reveals the full device specification when read, as shown in the following
-example:
+example::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat devspec
/vdevice/vty-server@30000004
Each vty-server sysfs dir is provided with two read-only attributes that
provide lists of easily parsed partner vty data: "partner_vtys" and
-"partner_clcs".
+"partner_clcs"::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat partner_vtys
30000000
@@ -396,7 +399,7 @@ A vty-server can only be connected to a single vty at a time. The entry,
read.
The current_vty can be changed by writing a valid partner clc to the entry
-as in the following example:
+as in the following example::
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # echo U5112.428.10304
8A-V4-C0 > current_vty
@@ -408,9 +411,9 @@ currently open connection is freed.
Information on the "vterm_state" attribute was covered earlier on the
chapter entitled "disconnection".
----------------------------------------------------------------------------
8. Questions & Answers:
-===========================================================================
+=======================
+
Q: What are the security concerns involving hvcs?
A: There are three main security concerns:
@@ -429,6 +432,7 @@ A: There are three main security concerns:
partition) will experience the previously logged in session.
---------------------------------------------------------------------------
+
Q: How do I multiplex a console that I grab through hvcs so that other
people can see it:
@@ -440,6 +444,7 @@ term type "screen" to others. This means that curses based programs may
not display properly in screen sessions.
---------------------------------------------------------------------------
+
Q: Why are the colors all messed up?
Q: Why are the control characters acting strange or not working?
Q: Why is the console output all strange and unintelligible?
@@ -455,6 +460,7 @@ disconnect from the console. This will ensure that the next user gets
their own TERM type set when they login.
---------------------------------------------------------------------------
+
Q: When I try to CONNECT kermit to an hvcs device I get:
"Sorry, can't open connection: /dev/hvcs*"What is happening?
@@ -490,6 +496,7 @@ A: There is not a corresponding vty-server device that maps to an existing
/dev/hvcs* entry.
---------------------------------------------------------------------------
+
Q: When I try to CONNECT kermit to an hvcs device I get:
"Sorry, write access to UUCP lockfile directory denied."
@@ -497,6 +504,7 @@ A: The /dev/hvcs* entry you have specified doesn't exist where you said it
does? Maybe you haven't inserted the module (on systems with udev).
---------------------------------------------------------------------------
+
Q: If I already have one Linux partition installed can I use hvcs on said
partition to provide the console for the install of a second Linux
partition?
@@ -505,6 +513,7 @@ A: Yes granted that your are connected to the /dev/hvcs* device using
kermit or cu or some other program that doesn't provide terminal emulation.
---------------------------------------------------------------------------
+
Q: Can I connect to more than one partition's console at a time using this
driver?
@@ -512,6 +521,7 @@ A: Yes. Of course this means that there must be more than one vty-server
configured for this partition and each must point to a disconnected vty.
---------------------------------------------------------------------------
+
Q: Does the hvcs driver support dynamic (hotplug) addition of devices?
A: Yes, if you have dlpar and hotplug enabled for your system and it has
@@ -519,6 +529,7 @@ been built into the kernel the hvcs drivers is configured to dynamically
handle additions of new devices and removals of unused devices.
---------------------------------------------------------------------------
+
Q: For some reason /dev/hvcs* doesn't map to the same vty-server adapter
after a reboot. What happened?
@@ -533,6 +544,7 @@ on how to determine which vty-server goes with which /dev/hvcs* node.
Hint; look at the sysfs "index" attribute for the vty-server.
---------------------------------------------------------------------------
+
Q: Can I use /dev/hvcs* as a conduit to another partition and use a tty
device on that partition as the other end of the pipe?
@@ -554,7 +566,9 @@ read or write to /dev/hvcs*. Now you have a tty conduit between two
partitions.
---------------------------------------------------------------------------
+
9. Reporting Bugs:
+==================
The proper channel for reporting bugs is either through the Linux OS
distribution company that provided your OS or by posting issues to the
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
new file mode 100644
index 000000000000..db7b6a880f52
--- /dev/null
+++ b/Documentation/powerpc/index.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+powerpc
+=======
+
+.. toctree::
+ :maxdepth: 1
+
+ bootwrapper
+ cpu_families
+ cpu_features
+ cxl
+ cxlflash
+ dawr-power9
+ dscr
+ eeh-pci-error-recovery
+ elfnote
+ firmware-assisted-dump
+ hvcs
+ isa-versions
+ mpc52xx
+ pci_iov_resource_on_powernv
+ pmu-ebb
+ ptrace
+ qe_firmware
+ syscall64-abi
+ transactional_memory
+ ultravisor
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/powerpc/isa-versions.rst b/Documentation/powerpc/isa-versions.rst
index 812e20cc898c..a363d8c1603c 100644
--- a/Documentation/powerpc/isa-versions.rst
+++ b/Documentation/powerpc/isa-versions.rst
@@ -1,11 +1,12 @@
+==========================
CPU to ISA Version Mapping
==========================
Mapping of some CPU versions to relevant ISA versions.
-========= ====================
+========= ====================================================================
CPU Architecture version
-========= ====================
+========= ====================================================================
Power9 Power ISA v3.0B
Power8 Power ISA v2.07
Power7 Power ISA v2.06
@@ -22,7 +23,7 @@ PPC970 - PowerPC User Instruction Set Architecture Book I v2.01
- PowerPC Virtual Environment Architecture Book II v2.01
- PowerPC Operating Environment Architecture Book III v2.01
- Plus Altivec/VMX ~= 2.03
-========= ====================
+========= ====================================================================
Key Features
@@ -58,9 +59,9 @@ Power5 No
PPC970 No
========== ====
-========== ====================
+========== ====================================
CPU Transactional Memory
-========== ====================
+========== ====================================
Power9 Yes (* see transactional_memory.txt)
Power8 Yes
Power7 No
@@ -71,4 +72,4 @@ Power5++ No
Power5+ No
Power5 No
PPC970 No
-========== ====================
+========== ====================================
diff --git a/Documentation/powerpc/mpc52xx.txt b/Documentation/powerpc/mpc52xx.rst
index 0d540a31ea1a..8676ac63e077 100644
--- a/Documentation/powerpc/mpc52xx.txt
+++ b/Documentation/powerpc/mpc52xx.rst
@@ -1,11 +1,13 @@
+=============================
Linux 2.6.x on MPC52xx family
------------------------------
+=============================
For the latest info, go to http://www.246tNt.com/mpc52xx/
To compile/use :
- - U-Boot:
+ - U-Boot::
+
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
if you wish to ).
# make lite5200_defconfig
@@ -16,7 +18,8 @@ To compile/use :
=> tftpboot 400000 pRamdisk
=> bootm 200000 400000
- - DBug:
+ - DBug::
+
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
if you wish to ).
# make lite5200_defconfig
@@ -28,7 +31,8 @@ To compile/use :
DBug> dn -i zImage.initrd.lite5200
-Some remarks :
+Some remarks:
+
- The port is named mpc52xxx, and config options are PPC_MPC52xx. The MGT5100
is not supported, and I'm not sure anyone is interesting in working on it
so. I didn't took 5xxx because there's apparently a lot of 5xxx that have
diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.rst
index b55c5cd83f8d..f5a5793e1613 100644
--- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.rst
@@ -1,6 +1,13 @@
+===================================================
+PCI Express I/O Virtualization Resource on Powerenv
+===================================================
+
Wei Yang <weiyang@linux.vnet.ibm.com>
+
Benjamin Herrenschmidt <benh@au1.ibm.com>
+
Bjorn Helgaas <bhelgaas@google.com>
+
26 Aug 2014
This document describes the requirement from hardware for PCI MMIO resource
@@ -10,6 +17,7 @@ Endpoints and the implementation on P8 (IODA2). The next two sections talks
about considerations on enabling SRIOV on IODA2.
1. Introduction to Partitionable Endpoints
+==========================================
A Partitionable Endpoint (PE) is a way to group the various resources
associated with a device or a set of devices to provide isolation between
@@ -35,6 +43,7 @@ is a completely separate HW entity that replicates the entire logic, so has
its own set of PEs, etc.
2. Implementation of Partitionable Endpoints on P8 (IODA2)
+==========================================================
P8 supports up to 256 Partitionable Endpoints per PHB.
@@ -149,6 +158,7 @@ P8 supports up to 256 Partitionable Endpoints per PHB.
sense, but we haven't done it yet.
3. Considerations for SR-IOV on PowerKVM
+========================================
* SR-IOV Background
@@ -224,7 +234,7 @@ P8 supports up to 256 Partitionable Endpoints per PHB.
IODA supports 256 PEs, so segmented windows contain 256 segments, so if
total_VFs is less than 256, we have the situation in Figure 1.0, where
segments [total_VFs, 255] of the M64 window may map to some MMIO range on
- other devices:
+ other devices::
0 1 total_VFs - 1
+------+------+- -+------+------+
@@ -243,7 +253,7 @@ P8 supports up to 256 Partitionable Endpoints per PHB.
Figure 1.0 Direct map VF(n) BAR space
Our current solution is to allocate 256 segments even if the VF(n) BAR
- space doesn't need that much, as shown in Figure 1.1:
+ space doesn't need that much, as shown in Figure 1.1::
0 1 total_VFs - 1 255
+------+------+- -+------+------+- -+------+------+
@@ -269,6 +279,7 @@ P8 supports up to 256 Partitionable Endpoints per PHB.
responds to segments [total_VFs, 255].
4. Implications for the Generic PCI Code
+========================================
The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
aligned to the size of an individual VF BAR.
diff --git a/Documentation/powerpc/pmu-ebb.txt b/Documentation/powerpc/pmu-ebb.rst
index 73cd163dbfb8..4f474758eb55 100644
--- a/Documentation/powerpc/pmu-ebb.txt
+++ b/Documentation/powerpc/pmu-ebb.rst
@@ -1,3 +1,4 @@
+========================
PMU Event Based Branches
========================
diff --git a/Documentation/powerpc/ptrace.rst b/Documentation/powerpc/ptrace.rst
new file mode 100644
index 000000000000..864d4b6dddd1
--- /dev/null
+++ b/Documentation/powerpc/ptrace.rst
@@ -0,0 +1,156 @@
+======
+Ptrace
+======
+
+GDB intends to support the following hardware debug features of BookE
+processors:
+
+4 hardware breakpoints (IAC)
+2 hardware watchpoints (read, write and read-write) (DAC)
+2 value conditions for the hardware watchpoints (DVC)
+
+For that, we need to extend ptrace so that GDB can query and set these
+resources. Since we're extending, we're trying to create an interface
+that's extendable and that covers both BookE and server processors, so
+that GDB doesn't need to special-case each of them. We added the
+following 3 new ptrace requests.
+
+1. PTRACE_PPC_GETHWDEBUGINFO
+============================
+
+Query for GDB to discover the hardware debug features. The main info to
+be returned here is the minimum alignment for the hardware watchpoints.
+BookE processors don't have restrictions here, but server processors have
+an 8-byte alignment restriction for hardware watchpoints. We'd like to avoid
+adding special cases to GDB based on what it sees in AUXV.
+
+Since we're at it, we added other useful info that the kernel can return to
+GDB: this query will return the number of hardware breakpoints, hardware
+watchpoints and whether it supports a range of addresses and a condition.
+The query will fill the following structure provided by the requesting process::
+
+ struct ppc_debug_info {
+ unit32_t version;
+ unit32_t num_instruction_bps;
+ unit32_t num_data_bps;
+ unit32_t num_condition_regs;
+ unit32_t data_bp_alignment;
+ unit32_t sizeof_condition; /* size of the DVC register */
+ uint64_t features; /* bitmask of the individual flags */
+ };
+
+features will have bits indicating whether there is support for::
+
+ #define PPC_DEBUG_FEATURE_INSN_BP_RANGE 0x1
+ #define PPC_DEBUG_FEATURE_INSN_BP_MASK 0x2
+ #define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x4
+ #define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x8
+ #define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x10
+
+2. PTRACE_SETHWDEBUG
+
+Sets a hardware breakpoint or watchpoint, according to the provided structure::
+
+ struct ppc_hw_breakpoint {
+ uint32_t version;
+ #define PPC_BREAKPOINT_TRIGGER_EXECUTE 0x1
+ #define PPC_BREAKPOINT_TRIGGER_READ 0x2
+ #define PPC_BREAKPOINT_TRIGGER_WRITE 0x4
+ uint32_t trigger_type; /* only some combinations allowed */
+ #define PPC_BREAKPOINT_MODE_EXACT 0x0
+ #define PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE 0x1
+ #define PPC_BREAKPOINT_MODE_RANGE_EXCLUSIVE 0x2
+ #define PPC_BREAKPOINT_MODE_MASK 0x3
+ uint32_t addr_mode; /* address match mode */
+
+ #define PPC_BREAKPOINT_CONDITION_MODE 0x3
+ #define PPC_BREAKPOINT_CONDITION_NONE 0x0
+ #define PPC_BREAKPOINT_CONDITION_AND 0x1
+ #define PPC_BREAKPOINT_CONDITION_EXACT 0x1 /* different name for the same thing as above */
+ #define PPC_BREAKPOINT_CONDITION_OR 0x2
+ #define PPC_BREAKPOINT_CONDITION_AND_OR 0x3
+ #define PPC_BREAKPOINT_CONDITION_BE_ALL 0x00ff0000 /* byte enable bits */
+ #define PPC_BREAKPOINT_CONDITION_BE(n) (1<<((n)+16))
+ uint32_t condition_mode; /* break/watchpoint condition flags */
+
+ uint64_t addr;
+ uint64_t addr2;
+ uint64_t condition_value;
+ };
+
+A request specifies one event, not necessarily just one register to be set.
+For instance, if the request is for a watchpoint with a condition, both the
+DAC and DVC registers will be set in the same request.
+
+With this GDB can ask for all kinds of hardware breakpoints and watchpoints
+that the BookE supports. COMEFROM breakpoints available in server processors
+are not contemplated, but that is out of the scope of this work.
+
+ptrace will return an integer (handle) uniquely identifying the breakpoint or
+watchpoint just created. This integer will be used in the PTRACE_DELHWDEBUG
+request to ask for its removal. Return -ENOSPC if the requested breakpoint
+can't be allocated on the registers.
+
+Some examples of using the structure to:
+
+- set a breakpoint in the first breakpoint register::
+
+ p.version = PPC_DEBUG_CURRENT_VERSION;
+ p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
+ p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
+ p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
+ p.addr = (uint64_t) address;
+ p.addr2 = 0;
+ p.condition_value = 0;
+
+- set a watchpoint which triggers on reads in the second watchpoint register::
+
+ p.version = PPC_DEBUG_CURRENT_VERSION;
+ p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
+ p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
+ p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
+ p.addr = (uint64_t) address;
+ p.addr2 = 0;
+ p.condition_value = 0;
+
+- set a watchpoint which triggers only with a specific value::
+
+ p.version = PPC_DEBUG_CURRENT_VERSION;
+ p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
+ p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
+ p.condition_mode = PPC_BREAKPOINT_CONDITION_AND | PPC_BREAKPOINT_CONDITION_BE_ALL;
+ p.addr = (uint64_t) address;
+ p.addr2 = 0;
+ p.condition_value = (uint64_t) condition;
+
+- set a ranged hardware breakpoint::
+
+ p.version = PPC_DEBUG_CURRENT_VERSION;
+ p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
+ p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
+ p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
+ p.addr = (uint64_t) begin_range;
+ p.addr2 = (uint64_t) end_range;
+ p.condition_value = 0;
+
+- set a watchpoint in server processors (BookS)::
+
+ p.version = 1;
+ p.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
+ p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
+ or
+ p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
+
+ p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
+ p.addr = (uint64_t) begin_range;
+ /* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where
+ * addr2 - addr <= 8 Bytes.
+ */
+ p.addr2 = (uint64_t) end_range;
+ p.condition_value = 0;
+
+3. PTRACE_DELHWDEBUG
+
+Takes an integer which identifies an existing breakpoint or watchpoint
+(i.e., the value returned from PTRACE_SETHWDEBUG), and deletes the
+corresponding breakpoint or watchpoint..
diff --git a/Documentation/powerpc/ptrace.txt b/Documentation/powerpc/ptrace.txt
deleted file mode 100644
index 99c5ce88d0fe..000000000000
--- a/Documentation/powerpc/ptrace.txt
+++ /dev/null
@@ -1,151 +0,0 @@
-GDB intends to support the following hardware debug features of BookE
-processors:
-
-4 hardware breakpoints (IAC)
-2 hardware watchpoints (read, write and read-write) (DAC)
-2 value conditions for the hardware watchpoints (DVC)
-
-For that, we need to extend ptrace so that GDB can query and set these
-resources. Since we're extending, we're trying to create an interface
-that's extendable and that covers both BookE and server processors, so
-that GDB doesn't need to special-case each of them. We added the
-following 3 new ptrace requests.
-
-1. PTRACE_PPC_GETHWDEBUGINFO
-
-Query for GDB to discover the hardware debug features. The main info to
-be returned here is the minimum alignment for the hardware watchpoints.
-BookE processors don't have restrictions here, but server processors have
-an 8-byte alignment restriction for hardware watchpoints. We'd like to avoid
-adding special cases to GDB based on what it sees in AUXV.
-
-Since we're at it, we added other useful info that the kernel can return to
-GDB: this query will return the number of hardware breakpoints, hardware
-watchpoints and whether it supports a range of addresses and a condition.
-The query will fill the following structure provided by the requesting process:
-
-struct ppc_debug_info {
- unit32_t version;
- unit32_t num_instruction_bps;
- unit32_t num_data_bps;
- unit32_t num_condition_regs;
- unit32_t data_bp_alignment;
- unit32_t sizeof_condition; /* size of the DVC register */
- uint64_t features; /* bitmask of the individual flags */
-};
-
-features will have bits indicating whether there is support for:
-
-#define PPC_DEBUG_FEATURE_INSN_BP_RANGE 0x1
-#define PPC_DEBUG_FEATURE_INSN_BP_MASK 0x2
-#define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x4
-#define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x8
-#define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x10
-
-2. PTRACE_SETHWDEBUG
-
-Sets a hardware breakpoint or watchpoint, according to the provided structure:
-
-struct ppc_hw_breakpoint {
- uint32_t version;
-#define PPC_BREAKPOINT_TRIGGER_EXECUTE 0x1
-#define PPC_BREAKPOINT_TRIGGER_READ 0x2
-#define PPC_BREAKPOINT_TRIGGER_WRITE 0x4
- uint32_t trigger_type; /* only some combinations allowed */
-#define PPC_BREAKPOINT_MODE_EXACT 0x0
-#define PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE 0x1
-#define PPC_BREAKPOINT_MODE_RANGE_EXCLUSIVE 0x2
-#define PPC_BREAKPOINT_MODE_MASK 0x3
- uint32_t addr_mode; /* address match mode */
-
-#define PPC_BREAKPOINT_CONDITION_MODE 0x3
-#define PPC_BREAKPOINT_CONDITION_NONE 0x0
-#define PPC_BREAKPOINT_CONDITION_AND 0x1
-#define PPC_BREAKPOINT_CONDITION_EXACT 0x1 /* different name for the same thing as above */
-#define PPC_BREAKPOINT_CONDITION_OR 0x2
-#define PPC_BREAKPOINT_CONDITION_AND_OR 0x3
-#define PPC_BREAKPOINT_CONDITION_BE_ALL 0x00ff0000 /* byte enable bits */
-#define PPC_BREAKPOINT_CONDITION_BE(n) (1<<((n)+16))
- uint32_t condition_mode; /* break/watchpoint condition flags */
-
- uint64_t addr;
- uint64_t addr2;
- uint64_t condition_value;
-};
-
-A request specifies one event, not necessarily just one register to be set.
-For instance, if the request is for a watchpoint with a condition, both the
-DAC and DVC registers will be set in the same request.
-
-With this GDB can ask for all kinds of hardware breakpoints and watchpoints
-that the BookE supports. COMEFROM breakpoints available in server processors
-are not contemplated, but that is out of the scope of this work.
-
-ptrace will return an integer (handle) uniquely identifying the breakpoint or
-watchpoint just created. This integer will be used in the PTRACE_DELHWDEBUG
-request to ask for its removal. Return -ENOSPC if the requested breakpoint
-can't be allocated on the registers.
-
-Some examples of using the structure to:
-
-- set a breakpoint in the first breakpoint register
-
- p.version = PPC_DEBUG_CURRENT_VERSION;
- p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
- p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
- p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
- p.addr = (uint64_t) address;
- p.addr2 = 0;
- p.condition_value = 0;
-
-- set a watchpoint which triggers on reads in the second watchpoint register
-
- p.version = PPC_DEBUG_CURRENT_VERSION;
- p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
- p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
- p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
- p.addr = (uint64_t) address;
- p.addr2 = 0;
- p.condition_value = 0;
-
-- set a watchpoint which triggers only with a specific value
-
- p.version = PPC_DEBUG_CURRENT_VERSION;
- p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
- p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
- p.condition_mode = PPC_BREAKPOINT_CONDITION_AND | PPC_BREAKPOINT_CONDITION_BE_ALL;
- p.addr = (uint64_t) address;
- p.addr2 = 0;
- p.condition_value = (uint64_t) condition;
-
-- set a ranged hardware breakpoint
-
- p.version = PPC_DEBUG_CURRENT_VERSION;
- p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
- p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
- p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
- p.addr = (uint64_t) begin_range;
- p.addr2 = (uint64_t) end_range;
- p.condition_value = 0;
-
-- set a watchpoint in server processors (BookS)
-
- p.version = 1;
- p.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
- p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
- or
- p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
-
- p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
- p.addr = (uint64_t) begin_range;
- /* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where
- * addr2 - addr <= 8 Bytes.
- */
- p.addr2 = (uint64_t) end_range;
- p.condition_value = 0;
-
-3. PTRACE_DELHWDEBUG
-
-Takes an integer which identifies an existing breakpoint or watchpoint
-(i.e., the value returned from PTRACE_SETHWDEBUG), and deletes the
-corresponding breakpoint or watchpoint..
diff --git a/Documentation/powerpc/qe_firmware.txt b/Documentation/powerpc/qe_firmware.rst
index e7ac24aec4ff..42f5103140c9 100644
--- a/Documentation/powerpc/qe_firmware.txt
+++ b/Documentation/powerpc/qe_firmware.rst
@@ -1,23 +1,23 @@
- Freescale QUICC Engine Firmware Uploading
- -----------------------------------------
+=========================================
+Freescale QUICC Engine Firmware Uploading
+=========================================
(c) 2007 Timur Tabi <timur at freescale.com>,
Freescale Semiconductor
-Table of Contents
-=================
+.. Table of Contents
- I - Software License for Firmware
+ I - Software License for Firmware
- II - Microcode Availability
+ II - Microcode Availability
- III - Description and Terminology
+ III - Description and Terminology
- IV - Microcode Programming Details
+ IV - Microcode Programming Details
- V - Firmware Structure Layout
+ V - Firmware Structure Layout
- VI - Sample Code for Creating Firmware Files
+ VI - Sample Code for Creating Firmware Files
Revision Information
====================
@@ -39,7 +39,7 @@ http://opensource.freescale.com. For other firmware files, please contact
your Freescale representative or your operating system vendor.
III - Description and Terminology
-================================
+=================================
In this document, the term 'microcode' refers to the sequence of 32-bit
integers that compose the actual QE microcode.
@@ -89,7 +89,7 @@ being fixed in the RAM package utilizing they should be activated. This data
structure signals the microcode which of these virtual traps is active.
This structure contains 6 words that the application should copy to some
-specific been defined. This table describes the structure.
+specific been defined. This table describes the structure::
---------------------------------------------------------------
| Offset in | | Destination Offset | Size of |
@@ -119,7 +119,7 @@ Extended Modes
This is a double word bit array (64 bits) that defines special functionality
which has an impact on the software drivers. Each bit has its own impact
and has special instructions for the s/w associated with it. This structure is
-described in this table:
+described in this table::
-----------------------------------------------------------------------
| Bit # | Name | Description |
@@ -220,7 +220,8 @@ The 'model' field is a 16-bit number that matches the actual SOC. The
'major' and 'minor' fields are the major and minor revision numbers,
respectively, of the SOC.
-For example, to match the 8323, revision 1.0:
+For example, to match the 8323, revision 1.0::
+
soc.model = 8323
soc.major = 1
soc.minor = 0
@@ -273,10 +274,10 @@ library and available to any driver that calles qe_get_firmware_info().
'reserved'.
After the last microcode is a 32-bit CRC. It can be calculated using
-this algorithm:
+this algorithm::
-u32 crc32(const u8 *p, unsigned int len)
-{
+ u32 crc32(const u8 *p, unsigned int len)
+ {
unsigned int i;
u32 crc = 0;
@@ -286,7 +287,7 @@ u32 crc32(const u8 *p, unsigned int len)
crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
}
return crc;
-}
+ }
VI - Sample Code for Creating Firmware Files
============================================
diff --git a/Documentation/powerpc/syscall64-abi.txt b/Documentation/powerpc/syscall64-abi.rst
index fa716a0d88bd..e49f69f941b9 100644
--- a/Documentation/powerpc/syscall64-abi.txt
+++ b/Documentation/powerpc/syscall64-abi.rst
@@ -5,12 +5,12 @@ Power Architecture 64-bit Linux system call ABI
syscall
=======
-syscall calling sequence[*] matches the Power Architecture 64-bit ELF ABI
+syscall calling sequence\ [1]_ matches the Power Architecture 64-bit ELF ABI
specification C function calling sequence, including register preservation
rules, with the following differences.
-[*] Some syscalls (typically low-level management functions) may have
- different calling sequences (e.g., rt_sigreturn).
+.. [1] Some syscalls (typically low-level management functions) may have
+ different calling sequences (e.g., rt_sigreturn).
Parameters and return value
---------------------------
@@ -33,12 +33,14 @@ Register preservation rules
Register preservation rules match the ELF ABI calling sequence with the
following differences:
-r0: Volatile. (System call number.)
-r3: Volatile. (Parameter 1, and return value.)
-r4-r8: Volatile. (Parameters 2-6.)
-cr0: Volatile (cr0.SO is the return error condition)
-cr1, cr5-7: Nonvolatile.
-lr: Nonvolatile.
+=========== ============= ========================================
+r0 Volatile (System call number.)
+r3 Volatile (Parameter 1, and return value.)
+r4-r8 Volatile (Parameters 2-6.)
+cr0 Volatile (cr0.SO is the return error condition)
+cr1, cr5-7 Nonvolatile
+lr Nonvolatile
+=========== ============= ========================================
All floating point and vector data registers as well as control and status
registers are nonvolatile.
@@ -90,9 +92,12 @@ The vsyscall may or may not use the caller's stack frame save areas.
Register preservation rules
---------------------------
-r0: Volatile.
-cr1, cr5-7: Volatile.
-lr: Volatile.
+
+=========== ========
+r0 Volatile
+cr1, cr5-7 Volatile
+lr Volatile
+=========== ========
Invocation
----------
diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.rst
index 52c023e14f26..09955103acb4 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.rst
@@ -1,3 +1,4 @@
+============================
Transactional Memory support
============================
@@ -17,29 +18,29 @@ instructions are presented to delimit transactions; transactions are
guaranteed to either complete atomically or roll back and undo any partial
changes.
-A simple transaction looks like this:
+A simple transaction looks like this::
-begin_move_money:
- tbegin
- beq abort_handler
+ begin_move_money:
+ tbegin
+ beq abort_handler
- ld r4, SAVINGS_ACCT(r3)
- ld r5, CURRENT_ACCT(r3)
- subi r5, r5, 1
- addi r4, r4, 1
- std r4, SAVINGS_ACCT(r3)
- std r5, CURRENT_ACCT(r3)
+ ld r4, SAVINGS_ACCT(r3)
+ ld r5, CURRENT_ACCT(r3)
+ subi r5, r5, 1
+ addi r4, r4, 1
+ std r4, SAVINGS_ACCT(r3)
+ std r5, CURRENT_ACCT(r3)
- tend
+ tend
- b continue
+ b continue
-abort_handler:
- ... test for odd failures ...
+ abort_handler:
+ ... test for odd failures ...
- /* Retry the transaction if it failed because it conflicted with
- * someone else: */
- b begin_move_money
+ /* Retry the transaction if it failed because it conflicted with
+ * someone else: */
+ b begin_move_money
The 'tbegin' instruction denotes the start point, and 'tend' the end point.
@@ -123,7 +124,7 @@ Transaction-aware signal handlers can read the transactional register state
from the second ucontext. This will be necessary for crash handlers to
determine, for example, the address of the instruction causing the SIGSEGV.
-Example signal handler:
+Example signal handler::
void crash_handler(int sig, siginfo_t *si, void *uc)
{
@@ -133,9 +134,9 @@ Example signal handler:
if (ucp_link) {
u64 msr = ucp->uc_mcontext.regs->msr;
/* May have transactional ucontext! */
-#ifndef __powerpc64__
+ #ifndef __powerpc64__
msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32;
-#endif
+ #endif
if (MSR_TM_ACTIVE(msr)) {
/* Yes, we crashed during a transaction. Oops. */
fprintf(stderr, "Transaction to be restarted at 0x%llx, but "
@@ -176,6 +177,7 @@ Failure cause codes used by kernel
These are defined in <asm/reg.h>, and distinguish different reasons why the
kernel aborted a transaction:
+ ====================== ================================
TM_CAUSE_RESCHED Thread was rescheduled.
TM_CAUSE_TLBI Software TLB invalid.
TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap.
@@ -184,6 +186,7 @@ kernel aborted a transaction:
TM_CAUSE_MISC Currently unused.
TM_CAUSE_ALIGNMENT Alignment fault.
TM_CAUSE_EMULATE Emulation that touched memory.
+ ====================== ================================
These can be checked by the user program's abort handler as TEXASR[0:7]. If
bit 7 is set, it indicates that the error is consider persistent. For example
@@ -203,7 +206,7 @@ POWER9
======
TM on POWER9 has issues with storing the complete register state. This
-is described in this commit:
+is described in this commit::
commit 4bb3c7a0208fc13ca70598efd109901a7cd45ae7
Author: Paul Mackerras <paulus@ozlabs.org>
diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
new file mode 100644
index 000000000000..730854f73830
--- /dev/null
+++ b/Documentation/powerpc/ultravisor.rst
@@ -0,0 +1,1054 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _ultravisor:
+
+============================
+Protected Execution Facility
+============================
+
+.. contents::
+ :depth: 3
+
+Protected Execution Facility
+############################
+
+ Protected Execution Facility (PEF) is an architectural change for
+ POWER 9 that enables Secure Virtual Machines (SVMs). DD2.3 chips
+ (PVR=0x004e1203) or greater will be PEF-capable. A new ISA release
+ will include the PEF RFC02487 changes.
+
+ When enabled, PEF adds a new higher privileged mode, called Ultravisor
+ mode, to POWER architecture. Along with the new mode there is new
+ firmware called the Protected Execution Ultravisor (or Ultravisor
+ for short). Ultravisor mode is the highest privileged mode in POWER
+ architecture.
+
+ +------------------+
+ | Privilege States |
+ +==================+
+ | Problem |
+ +------------------+
+ | Supervisor |
+ +------------------+
+ | Hypervisor |
+ +------------------+
+ | Ultravisor |
+ +------------------+
+
+ PEF protects SVMs from the hypervisor, privileged users, and other
+ VMs in the system. SVMs are protected while at rest and can only be
+ executed by an authorized machine. All virtual machines utilize
+ hypervisor services. The Ultravisor filters calls between the SVMs
+ and the hypervisor to assure that information does not accidentally
+ leak. All hypercalls except H_RANDOM are reflected to the hypervisor.
+ H_RANDOM is not reflected to prevent the hypervisor from influencing
+ random values in the SVM.
+
+ To support this there is a refactoring of the ownership of resources
+ in the CPU. Some of the resources which were previously hypervisor
+ privileged are now ultravisor privileged.
+
+Hardware
+========
+
+ The hardware changes include the following:
+
+ * There is a new bit in the MSR that determines whether the current
+ process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
+ is in secure mode, MSR(s)=0 process is in normal mode.
+
+ * The MSR(S) bit can only be set by the Ultravisor.
+
+ * HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
+ to return to a SVM it must use an ultracall. It can determine if
+ the VM it is returning to is secure.
+
+ * There is a new Ultravisor privileged register, SMFCTRL, which has an
+ enable/disable bit SMFCTRL(E).
+
+ * The privilege of a process is now determined by three MSR bits,
+ MSR(S, HV, PR). In each of the tables below the modes are listed
+ from least privilege to highest privilege. The higher privilege
+ modes can access all the resources of the lower privilege modes.
+
+ **Secure Mode MSR Settings**
+
+ +---+---+---+---------------+
+ | S | HV| PR|Privilege |
+ +===+===+===+===============+
+ | 1 | 0 | 1 | Problem |
+ +---+---+---+---------------+
+ | 1 | 0 | 0 | Privileged(OS)|
+ +---+---+---+---------------+
+ | 1 | 1 | 0 | Ultravisor |
+ +---+---+---+---------------+
+ | 1 | 1 | 1 | Reserved |
+ +---+---+---+---------------+
+
+ **Normal Mode MSR Settings**
+
+ +---+---+---+---------------+
+ | S | HV| PR|Privilege |
+ +===+===+===+===============+
+ | 0 | 0 | 1 | Problem |
+ +---+---+---+---------------+
+ | 0 | 0 | 0 | Privileged(OS)|
+ +---+---+---+---------------+
+ | 0 | 1 | 0 | Hypervisor |
+ +---+---+---+---------------+
+ | 0 | 1 | 1 | Problem (Host)|
+ +---+---+---+---------------+
+
+ * Memory is partitioned into secure and normal memory. Only processes
+ that are running in secure mode can access secure memory.
+
+ * The hardware does not allow anything that is not running secure to
+ access secure memory. This means that the Hypervisor cannot access
+ the memory of the SVM without using an ultracall (asking the
+ Ultravisor). The Ultravisor will only allow the hypervisor to see
+ the SVM memory encrypted.
+
+ * I/O systems are not allowed to directly address secure memory. This
+ limits the SVMs to virtual I/O only.
+
+ * The architecture allows the SVM to share pages of memory with the
+ hypervisor that are not protected with encryption. However, this
+ sharing must be initiated by the SVM.
+
+ * When a process is running in secure mode all hypercalls
+ (syscall lev=1) go to the Ultravisor.
+
+ * When a process is in secure mode all interrupts go to the
+ Ultravisor.
+
+ * The following resources have become Ultravisor privileged and
+ require an Ultravisor interface to manipulate:
+
+ * Processor configurations registers (SCOMs).
+
+ * Stop state information.
+
+ * The debug registers CIABR, DAWR, and DAWRX when SMFCTRL(D) is set.
+ If SMFCTRL(D) is not set they do not work in secure mode. When set,
+ reading and writing requires an Ultravisor call, otherwise that
+ will cause a Hypervisor Emulation Assistance interrupt.
+
+ * PTCR and partition table entries (partition table is in secure
+ memory). An attempt to write to PTCR will cause a Hypervisor
+ Emulation Assitance interrupt.
+
+ * LDBAR (LD Base Address Register) and IMC (In-Memory Collection)
+ non-architected registers. An attempt to write to them will cause a
+ Hypervisor Emulation Assistance interrupt.
+
+ * Paging for an SVM, sharing of memory with Hypervisor for an SVM.
+ (Including Virtual Processor Area (VPA) and virtual I/O).
+
+
+Software/Microcode
+==================
+
+ The software changes include:
+
+ * SVMs are created from normal VM using (open source) tooling supplied
+ by IBM.
+
+ * All SVMs start as normal VMs and utilize an ultracall, UV_ESM
+ (Enter Secure Mode), to make the transition.
+
+ * When the UV_ESM ultracall is made the Ultravisor copies the VM into
+ secure memory, decrypts the verification information, and checks the
+ integrity of the SVM. If the integrity check passes the Ultravisor
+ passes control in secure mode.
+
+ * The verification information includes the pass phrase for the
+ encrypted disk associated with the SVM. This pass phrase is given
+ to the SVM when requested.
+
+ * The Ultravisor is not involved in protecting the encrypted disk of
+ the SVM while at rest.
+
+ * For external interrupts the Ultravisor saves the state of the SVM,
+ and reflects the interrupt to the hypervisor for processing.
+ For hypercalls, the Ultravisor inserts neutral state into all
+ registers not needed for the hypercall then reflects the call to
+ the hypervisor for processing. The H_RANDOM hypercall is performed
+ by the Ultravisor and not reflected.
+
+ * For virtual I/O to work bounce buffering must be done.
+
+ * The Ultravisor uses AES (IAPM) for protection of SVM memory. IAPM
+ is a mode of AES that provides integrity and secrecy concurrently.
+
+ * The movement of data between normal and secure pages is coordinated
+ with the Ultravisor by a new HMM plug-in in the Hypervisor.
+
+ The Ultravisor offers new services to the hypervisor and SVMs. These
+ are accessed through ultracalls.
+
+Terminology
+===========
+
+ * Hypercalls: special system calls used to request services from
+ Hypervisor.
+
+ * Normal memory: Memory that is accessible to Hypervisor.
+
+ * Normal page: Page backed by normal memory and available to
+ Hypervisor.
+
+ * Shared page: A page backed by normal memory and available to both
+ the Hypervisor/QEMU and the SVM (i.e page has mappings in SVM and
+ Hypervisor/QEMU).
+
+ * Secure memory: Memory that is accessible only to Ultravisor and
+ SVMs.
+
+ * Secure page: Page backed by secure memory and only available to
+ Ultravisor and SVM.
+
+ * SVM: Secure Virtual Machine.
+
+ * Ultracalls: special system calls used to request services from
+ Ultravisor.
+
+
+Ultravisor calls API
+####################
+
+ This section describes Ultravisor calls (ultracalls) needed to
+ support Secure Virtual Machines (SVM)s and Paravirtualized KVM. The
+ ultracalls allow the SVMs and Hypervisor to request services from the
+ Ultravisor such as accessing a register or memory region that can only
+ be accessed when running in Ultravisor-privileged mode.
+
+ The specific service needed from an ultracall is specified in register
+ R3 (the first parameter to the ultracall). Other parameters to the
+ ultracall, if any, are specified in registers R4 through R12.
+
+ Return value of all ultracalls is in register R3. Other output values
+ from the ultracall, if any, are returned in registers R4 through R12.
+ The only exception to this register usage is the ``UV_RETURN``
+ ultracall described below.
+
+ Each ultracall returns specific error codes, applicable in the context
+ of the ultracall. However, like with the PowerPC Architecture Platform
+ Reference (PAPR), if no specific error code is defined for a
+ particular situation, then the ultracall will fallback to an erroneous
+ parameter-position based code. i.e U_PARAMETER, U_P2, U_P3 etc
+ depending on the ultracall parameter that may have caused the error.
+
+ Some ultracalls involve transferring a page of data between Ultravisor
+ and Hypervisor. Secure pages that are transferred from secure memory
+ to normal memory may be encrypted using dynamically generated keys.
+ When the secure pages are transferred back to secure memory, they may
+ be decrypted using the same dynamically generated keys. Generation and
+ management of these keys will be covered in a separate document.
+
+ For now this only covers ultracalls currently implemented and being
+ used by Hypervisor and SVMs but others can be added here when it
+ makes sense.
+
+ The full specification for all hypercalls/ultracalls will eventually
+ be made available in the public/OpenPower version of the PAPR
+ specification.
+
+ .. note::
+
+ If PEF is not enabled, the ultracalls will be redirected to the
+ Hypervisor which must handle/fail the calls.
+
+Ultracalls used by Hypervisor
+=============================
+
+ This section describes the virtual memory management ultracalls used
+ by the Hypervisor to manage SVMs.
+
+UV_PAGE_OUT
+-----------
+
+ Encrypt and move the contents of a page from secure memory to normal
+ memory.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_PAGE_OUT,
+ uint16_t lpid, /* LPAR ID */
+ uint64_t dest_ra, /* real address of destination page */
+ uint64_t src_gpa, /* source guest-physical-address */
+ uint8_t flags, /* flags */
+ uint64_t order) /* page size order */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``dest_ra`` is invalid.
+ * U_P3 if the ``src_gpa`` address is invalid.
+ * U_P4 if any bit in the ``flags`` is unrecognized
+ * U_P5 if the ``order`` parameter is unsupported.
+ * U_FUNCTION if functionality is not supported.
+ * U_BUSY if page cannot be currently paged-out.
+
+Description
+~~~~~~~~~~~
+
+ Encrypt the contents of a secure-page and make it available to
+ Hypervisor in a normal page.
+
+ By default, the source page is unmapped from the SVM's partition-
+ scoped page table. But the Hypervisor can provide a hint to the
+ Ultravisor to retain the page mapping by setting the ``UV_SNAPSHOT``
+ flag in ``flags`` parameter.
+
+ If the source page is already a shared page the call returns
+ U_SUCCESS, without doing anything.
+
+Use cases
+~~~~~~~~~
+
+ #. QEMU attempts to access an address belonging to the SVM but the
+ page frame for that address is not mapped into QEMU's address
+ space. In this case, the Hypervisor will allocate a page frame,
+ map it into QEMU's address space and issue the ``UV_PAGE_OUT``
+ call to retrieve the encrypted contents of the page.
+
+ #. When Ultravisor runs low on secure memory and it needs to page-out
+ an LRU page. In this case, Ultravisor will issue the
+ ``H_SVM_PAGE_OUT`` hypercall to the Hypervisor. The Hypervisor will
+ then allocate a normal page and issue the ``UV_PAGE_OUT`` ultracall
+ and the Ultravisor will encrypt and move the contents of the secure
+ page into the normal page.
+
+ #. When Hypervisor accesses SVM data, the Hypervisor requests the
+ Ultravisor to transfer the corresponding page into a insecure page,
+ which the Hypervisor can access. The data in the normal page will
+ be encrypted though.
+
+UV_PAGE_IN
+----------
+
+ Move the contents of a page from normal memory to secure memory.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_PAGE_IN,
+ uint16_t lpid, /* the LPAR ID */
+ uint64_t src_ra, /* source real address of page */
+ uint64_t dest_gpa, /* destination guest physical address */
+ uint64_t flags, /* flags */
+ uint64_t order) /* page size order */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_BUSY if page cannot be currently paged-in.
+ * U_FUNCTION if functionality is not supported
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``src_ra`` is invalid.
+ * U_P3 if the ``dest_gpa`` address is invalid.
+ * U_P4 if any bit in the ``flags`` is unrecognized
+ * U_P5 if the ``order`` parameter is unsupported.
+
+Description
+~~~~~~~~~~~
+
+ Move the contents of the page identified by ``src_ra`` from normal
+ memory to secure memory and map it to the guest physical address
+ ``dest_gpa``.
+
+ If `dest_gpa` refers to a shared address, map the page into the
+ partition-scoped page-table of the SVM. If `dest_gpa` is not shared,
+ copy the contents of the page into the corresponding secure page.
+ Depending on the context, decrypt the page before being copied.
+
+ The caller provides the attributes of the page through the ``flags``
+ parameter. Valid values for ``flags`` are:
+
+ * CACHE_INHIBITED
+ * CACHE_ENABLED
+ * WRITE_PROTECTION
+
+ The Hypervisor must pin the page in memory before making
+ ``UV_PAGE_IN`` ultracall.
+
+Use cases
+~~~~~~~~~
+
+ #. When a normal VM switches to secure mode, all its pages residing
+ in normal memory, are moved into secure memory.
+
+ #. When an SVM requests to share a page with Hypervisor the Hypervisor
+ allocates a page and informs the Ultravisor.
+
+ #. When an SVM accesses a secure page that has been paged-out,
+ Ultravisor invokes the Hypervisor to locate the page. After
+ locating the page, the Hypervisor uses UV_PAGE_IN to make the
+ page available to Ultravisor.
+
+UV_PAGE_INVAL
+-------------
+
+ Invalidate the Ultravisor mapping of a page.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_PAGE_INVAL,
+ uint16_t lpid, /* the LPAR ID */
+ uint64_t guest_pa, /* destination guest-physical-address */
+ uint64_t order) /* page size order */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``guest_pa`` is invalid (or corresponds to a secure
+ page mapping).
+ * U_P3 if the ``order`` is invalid.
+ * U_FUNCTION if functionality is not supported.
+ * U_BUSY if page cannot be currently invalidated.
+
+Description
+~~~~~~~~~~~
+
+ This ultracall informs Ultravisor that the page mapping in Hypervisor
+ corresponding to the given guest physical address has been invalidated
+ and that the Ultravisor should not access the page. If the specified
+ ``guest_pa`` corresponds to a secure page, Ultravisor will ignore the
+ attempt to invalidate the page and return U_P2.
+
+Use cases
+~~~~~~~~~
+
+ #. When a shared page is unmapped from the QEMU's page table, possibly
+ because it is paged-out to disk, Ultravisor needs to know that the
+ page should not be accessed from its side too.
+
+
+UV_WRITE_PATE
+-------------
+
+ Validate and write the partition table entry (PATE) for a given
+ partition.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_WRITE_PATE,
+ uint32_t lpid, /* the LPAR ID */
+ uint64_t dw0 /* the first double word to write */
+ uint64_t dw1) /* the second double word to write */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_BUSY if PATE cannot be currently written to.
+ * U_FUNCTION if functionality is not supported.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``dw0`` is invalid.
+ * U_P3 if the ``dw1`` address is invalid.
+ * U_PERMISSION if the Hypervisor is attempting to change the PATE
+ of a secure virtual machine or if called from a
+ context other than Hypervisor.
+
+Description
+~~~~~~~~~~~
+
+ Validate and write a LPID and its partition-table-entry for the given
+ LPID. If the LPID is already allocated and initialized, this call
+ results in changing the partition table entry.
+
+Use cases
+~~~~~~~~~
+
+ #. The Partition table resides in Secure memory and its entries,
+ called PATE (Partition Table Entries), point to the partition-
+ scoped page tables for the Hypervisor as well as each of the
+ virtual machines (both secure and normal). The Hypervisor
+ operates in partition 0 and its partition-scoped page tables
+ reside in normal memory.
+
+ #. This ultracall allows the Hypervisor to register the partition-
+ scoped and process-scoped page table entries for the Hypervisor
+ and other partitions (virtual machines) with the Ultravisor.
+
+ #. If the value of the PATE for an existing partition (VM) changes,
+ the TLB cache for the partition is flushed.
+
+ #. The Hypervisor is responsible for allocating LPID. The LPID and
+ its PATE entry are registered together. The Hypervisor manages
+ the PATE entries for a normal VM and can change the PATE entry
+ anytime. Ultravisor manages the PATE entries for an SVM and
+ Hypervisor is not allowed to modify them.
+
+UV_RETURN
+---------
+
+ Return control from the Hypervisor back to the Ultravisor after
+ processing an hypercall or interrupt that was forwarded (aka
+ *reflected*) to the Hypervisor.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_RETURN)
+
+Return values
+~~~~~~~~~~~~~
+
+ This call never returns to Hypervisor on success. It returns
+ U_INVALID if ultracall is not made from a Hypervisor context.
+
+Description
+~~~~~~~~~~~
+
+ When an SVM makes an hypercall or incurs some other exception, the
+ Ultravisor usually forwards (aka *reflects*) the exceptions to the
+ Hypervisor. After processing the exception, Hypervisor uses the
+ ``UV_RETURN`` ultracall to return control back to the SVM.
+
+ The expected register state on entry to this ultracall is:
+
+ * Non-volatile registers are restored to their original values.
+ * If returning from an hypercall, register R0 contains the return
+ value (**unlike other ultracalls**) and, registers R4 through R12
+ contain any output values of the hypercall.
+ * R3 contains the ultracall number, i.e UV_RETURN.
+ * If returning with a synthesized interrupt, R2 contains the
+ synthesized interrupt number.
+
+Use cases
+~~~~~~~~~
+
+ #. Ultravisor relies on the Hypervisor to provide several services to
+ the SVM such as processing hypercall and other exceptions. After
+ processing the exception, Hypervisor uses UV_RETURN to return
+ control back to the Ultravisor.
+
+ #. Hypervisor has to use this ultracall to return control to the SVM.
+
+
+UV_REGISTER_MEM_SLOT
+--------------------
+
+ Register an SVM address-range with specified properties.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_REGISTER_MEM_SLOT,
+ uint64_t lpid, /* LPAR ID of the SVM */
+ uint64_t start_gpa, /* start guest physical address */
+ uint64_t size, /* size of address range in bytes */
+ uint64_t flags /* reserved for future expansion */
+ uint16_t slotid) /* slot identifier */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``start_gpa`` is invalid.
+ * U_P3 if ``size`` is invalid.
+ * U_P4 if any bit in the ``flags`` is unrecognized.
+ * U_P5 if the ``slotid`` parameter is unsupported.
+ * U_PERMISSION if called from context other than Hypervisor.
+ * U_FUNCTION if functionality is not supported.
+
+
+Description
+~~~~~~~~~~~
+
+ Register a memory range for an SVM. The memory range starts at the
+ guest physical address ``start_gpa`` and is ``size`` bytes long.
+
+Use cases
+~~~~~~~~~
+
+
+ #. When a virtual machine goes secure, all the memory slots managed by
+ the Hypervisor move into secure memory. The Hypervisor iterates
+ through each of memory slots, and registers the slot with
+ Ultravisor. Hypervisor may discard some slots such as those used
+ for firmware (SLOF).
+
+ #. When new memory is hot-plugged, a new memory slot gets registered.
+
+
+UV_UNREGISTER_MEM_SLOT
+----------------------
+
+ Unregister an SVM address-range that was previously registered using
+ UV_REGISTER_MEM_SLOT.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_UNREGISTER_MEM_SLOT,
+ uint64_t lpid, /* LPAR ID of the SVM */
+ uint64_t slotid) /* reservation slotid */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_FUNCTION if functionality is not supported.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_P2 if ``slotid`` is invalid.
+ * U_PERMISSION if called from context other than Hypervisor.
+
+Description
+~~~~~~~~~~~
+
+ Release the memory slot identified by ``slotid`` and free any
+ resources allocated towards the reservation.
+
+Use cases
+~~~~~~~~~
+
+ #. Memory hot-remove.
+
+
+UV_SVM_TERMINATE
+----------------
+
+ Terminate an SVM and release its resources.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_SVM_TERMINATE,
+ uint64_t lpid, /* LPAR ID of the SVM */)
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_FUNCTION if functionality is not supported.
+ * U_PARAMETER if ``lpid`` is invalid.
+ * U_INVALID if VM is not secure.
+ * U_PERMISSION if not called from a Hypervisor context.
+
+Description
+~~~~~~~~~~~
+
+ Terminate an SVM and release all its resources.
+
+Use cases
+~~~~~~~~~
+
+ #. Called by Hypervisor when terminating an SVM.
+
+
+Ultracalls used by SVM
+======================
+
+UV_SHARE_PAGE
+-------------
+
+ Share a set of guest physical pages with the Hypervisor.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_SHARE_PAGE,
+ uint64_t gfn, /* guest page frame number */
+ uint64_t num) /* number of pages of size PAGE_SIZE */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_FUNCTION if functionality is not supported.
+ * U_INVALID if the VM is not secure.
+ * U_PARAMETER if ``gfn`` is invalid.
+ * U_P2 if ``num`` is invalid.
+
+Description
+~~~~~~~~~~~
+
+ Share the ``num`` pages starting at guest physical frame number ``gfn``
+ with the Hypervisor. Assume page size is PAGE_SIZE bytes. Zero the
+ pages before returning.
+
+ If the address is already backed by a secure page, unmap the page and
+ back it with an insecure page, with the help of the Hypervisor. If it
+ is not backed by any page yet, mark the PTE as insecure and back it
+ with an insecure page when the address is accessed. If it is already
+ backed by an insecure page, zero the page and return.
+
+Use cases
+~~~~~~~~~
+
+ #. The Hypervisor cannot access the SVM pages since they are backed by
+ secure pages. Hence an SVM must explicitly request Ultravisor for
+ pages it can share with Hypervisor.
+
+ #. Shared pages are needed to support virtio and Virtual Processor Area
+ (VPA) in SVMs.
+
+
+UV_UNSHARE_PAGE
+---------------
+
+ Restore a shared SVM page to its initial state.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_UNSHARE_PAGE,
+ uint64_t gfn, /* guest page frame number */
+ uint73 num) /* number of pages of size PAGE_SIZE*/
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_FUNCTION if functionality is not supported.
+ * U_INVALID if VM is not secure.
+ * U_PARAMETER if ``gfn`` is invalid.
+ * U_P2 if ``num`` is invalid.
+
+Description
+~~~~~~~~~~~
+
+ Stop sharing ``num`` pages starting at ``gfn`` with the Hypervisor.
+ Assume that the page size is PAGE_SIZE. Zero the pages before
+ returning.
+
+ If the address is already backed by an insecure page, unmap the page
+ and back it with a secure page. Inform the Hypervisor to release
+ reference to its shared page. If the address is not backed by a page
+ yet, mark the PTE as secure and back it with a secure page when that
+ address is accessed. If it is already backed by an secure page zero
+ the page and return.
+
+Use cases
+~~~~~~~~~
+
+ #. The SVM may decide to unshare a page from the Hypervisor.
+
+
+UV_UNSHARE_ALL_PAGES
+--------------------
+
+ Unshare all pages the SVM has shared with Hypervisor.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_UNSHARE_ALL_PAGES)
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success.
+ * U_FUNCTION if functionality is not supported.
+ * U_INVAL if VM is not secure.
+
+Description
+~~~~~~~~~~~
+
+ Unshare all shared pages from the Hypervisor. All unshared pages are
+ zeroed on return. Only pages explicitly shared by the SVM with the
+ Hypervisor (using UV_SHARE_PAGE ultracall) are unshared. Ultravisor
+ may internally share some pages with the Hypervisor without explicit
+ request from the SVM. These pages will not be unshared by this
+ ultracall.
+
+Use cases
+~~~~~~~~~
+
+ #. This call is needed when ``kexec`` is used to boot a different
+ kernel. It may also be needed during SVM reset.
+
+UV_ESM
+------
+
+ Secure the virtual machine (*enter secure mode*).
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t ultracall(const uint64_t UV_ESM,
+ uint64_t esm_blob_addr, /* location of the ESM blob */
+ unint64_t fdt) /* Flattened device tree */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * U_SUCCESS on success (including if VM is already secure).
+ * U_FUNCTION if functionality is not supported.
+ * U_INVALID if VM is not secure.
+ * U_PARAMETER if ``esm_blob_addr`` is invalid.
+ * U_P2 if ``fdt`` is invalid.
+ * U_PERMISSION if any integrity checks fail.
+ * U_RETRY insufficient memory to create SVM.
+ * U_NO_KEY symmetric key unavailable.
+
+Description
+~~~~~~~~~~~
+
+ Secure the virtual machine. On successful completion, return
+ control to the virtual machine at the address specified in the
+ ESM blob.
+
+Use cases
+~~~~~~~~~
+
+ #. A normal virtual machine can choose to switch to a secure mode.
+
+Hypervisor Calls API
+####################
+
+ This document describes the Hypervisor calls (hypercalls) that are
+ needed to support the Ultravisor. Hypercalls are services provided by
+ the Hypervisor to virtual machines and Ultravisor.
+
+ Register usage for these hypercalls is identical to that of the other
+ hypercalls defined in the Power Architecture Platform Reference (PAPR)
+ document. i.e on input, register R3 identifies the specific service
+ that is being requested and registers R4 through R11 contain
+ additional parameters to the hypercall, if any. On output, register
+ R3 contains the return value and registers R4 through R9 contain any
+ other output values from the hypercall.
+
+ This document only covers hypercalls currently implemented/planned
+ for Ultravisor usage but others can be added here when it makes sense.
+
+ The full specification for all hypercalls/ultracalls will eventually
+ be made available in the public/OpenPower version of the PAPR
+ specification.
+
+Hypervisor calls to support Ultravisor
+======================================
+
+ Following are the set of hypercalls needed to support Ultravisor.
+
+H_SVM_INIT_START
+----------------
+
+ Begin the process of converting a normal virtual machine into an SVM.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t hypercall(const uint64_t H_SVM_INIT_START)
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * H_SUCCESS on success.
+
+Description
+~~~~~~~~~~~
+
+ Initiate the process of securing a virtual machine. This involves
+ coordinating with the Ultravisor, using ultracalls, to allocate
+ resources in the Ultravisor for the new SVM, transferring the VM's
+ pages from normal to secure memory etc. When the process is
+ completed, Ultravisor issues the H_SVM_INIT_DONE hypercall.
+
+Use cases
+~~~~~~~~~
+
+ #. Ultravisor uses this hypercall to inform Hypervisor that a VM
+ has initiated the process of switching to secure mode.
+
+
+H_SVM_INIT_DONE
+---------------
+
+ Complete the process of securing an SVM.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t hypercall(const uint64_t H_SVM_INIT_DONE)
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * H_SUCCESS on success.
+ * H_UNSUPPORTED if called from the wrong context (e.g.
+ from an SVM or before an H_SVM_INIT_START
+ hypercall).
+
+Description
+~~~~~~~~~~~
+
+ Complete the process of securing a virtual machine. This call must
+ be made after a prior call to ``H_SVM_INIT_START`` hypercall.
+
+Use cases
+~~~~~~~~~
+
+ On successfully securing a virtual machine, the Ultravisor informs
+ Hypervisor about it. Hypervisor can use this call to finish setting
+ up its internal state for this virtual machine.
+
+
+H_SVM_PAGE_IN
+-------------
+
+ Move the contents of a page from normal memory to secure memory.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t hypercall(const uint64_t H_SVM_PAGE_IN,
+ uint64_t guest_pa, /* guest-physical-address */
+ uint64_t flags, /* flags */
+ uint64_t order) /* page size order */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * H_SUCCESS on success.
+ * H_PARAMETER if ``guest_pa`` is invalid.
+ * H_P2 if ``flags`` is invalid.
+ * H_P3 if ``order`` of page is invalid.
+
+Description
+~~~~~~~~~~~
+
+ Retrieve the content of the page, belonging to the VM at the specified
+ guest physical address.
+
+ Only valid value(s) in ``flags`` are:
+
+ * H_PAGE_IN_SHARED which indicates that the page is to be shared
+ with the Ultravisor.
+
+ * H_PAGE_IN_NONSHARED indicates that the UV is not anymore
+ interested in the page. Applicable if the page is a shared page.
+
+ The ``order`` parameter must correspond to the configured page size.
+
+Use cases
+~~~~~~~~~
+
+ #. When a normal VM becomes a secure VM (using the UV_ESM ultracall),
+ the Ultravisor uses this hypercall to move contents of each page of
+ the VM from normal memory to secure memory.
+
+ #. Ultravisor uses this hypercall to ask Hypervisor to provide a page
+ in normal memory that can be shared between the SVM and Hypervisor.
+
+ #. Ultravisor uses this hypercall to page-in a paged-out page. This
+ can happen when the SVM touches a paged-out page.
+
+ #. If SVM wants to disable sharing of pages with Hypervisor, it can
+ inform Ultravisor to do so. Ultravisor will then use this hypercall
+ and inform Hypervisor that it has released access to the normal
+ page.
+
+H_SVM_PAGE_OUT
+---------------
+
+ Move the contents of the page to normal memory.
+
+Syntax
+~~~~~~
+
+.. code-block:: c
+
+ uint64_t hypercall(const uint64_t H_SVM_PAGE_OUT,
+ uint64_t guest_pa, /* guest-physical-address */
+ uint64_t flags, /* flags (currently none) */
+ uint64_t order) /* page size order */
+
+Return values
+~~~~~~~~~~~~~
+
+ One of the following values:
+
+ * H_SUCCESS on success.
+ * H_PARAMETER if ``guest_pa`` is invalid.
+ * H_P2 if ``flags`` is invalid.
+ * H_P3 if ``order`` is invalid.
+
+Description
+~~~~~~~~~~~
+
+ Move the contents of the page identified by ``guest_pa`` to normal
+ memory.
+
+ Currently ``flags`` is unused and must be set to 0. The ``order``
+ parameter must correspond to the configured page size.
+
+Use cases
+~~~~~~~~~
+
+ #. If Ultravisor is running low on secure pages, it can move the
+ contents of some secure pages, into normal pages using this
+ hypercall. The content will be encrypted.
+
+References
+##########
+
+- `Supporting Protected Computing on IBM Power Architecture <https://developer.ibm.com/articles/l-support-protected-computing/>`_
diff --git a/Documentation/powerpc/vcpudispatch_stats.txt b/Documentation/powerpc/vcpudispatch_stats.txt
new file mode 100644
index 000000000000..e21476bfd78c
--- /dev/null
+++ b/Documentation/powerpc/vcpudispatch_stats.txt
@@ -0,0 +1,68 @@
+VCPU Dispatch Statistics:
+=========================
+
+For Shared Processor LPARs, the POWER Hypervisor maintains a relatively
+static mapping of the LPAR processors (vcpus) to physical processor
+chips (representing the "home" node) and tries to always dispatch vcpus
+on their associated physical processor chip. However, under certain
+scenarios, vcpus may be dispatched on a different processor chip (away
+from its home node).
+
+/proc/powerpc/vcpudispatch_stats can be used to obtain statistics
+related to the vcpu dispatch behavior. Writing '1' to this file enables
+collecting the statistics, while writing '0' disables the statistics.
+By default, the DTLB log for each vcpu is processed 50 times a second so
+as not to miss any entries. This processing frequency can be changed
+through /proc/powerpc/vcpudispatch_stats_freq.
+
+The statistics themselves are available by reading the procfs file
+/proc/powerpc/vcpudispatch_stats. Each line in the output corresponds to
+a vcpu as represented by the first field, followed by 8 numbers.
+
+The first number corresponds to:
+1. total vcpu dispatches since the beginning of statistics collection
+
+The next 4 numbers represent vcpu dispatch dispersions:
+2. number of times this vcpu was dispatched on the same processor as last
+ time
+3. number of times this vcpu was dispatched on a different processor core
+ as last time, but within the same chip
+4. number of times this vcpu was dispatched on a different chip
+5. number of times this vcpu was dispatches on a different socket/drawer
+(next numa boundary)
+
+The final 3 numbers represent statistics in relation to the home node of
+the vcpu:
+6. number of times this vcpu was dispatched in its home node (chip)
+7. number of times this vcpu was dispatched in a different node
+8. number of times this vcpu was dispatched in a node further away (numa
+distance)
+
+An example output:
+ $ sudo cat /proc/powerpc/vcpudispatch_stats
+ cpu0 6839 4126 2683 30 0 6821 18 0
+ cpu1 2515 1274 1229 12 0 2509 6 0
+ cpu2 2317 1198 1109 10 0 2312 5 0
+ cpu3 2259 1165 1088 6 0 2256 3 0
+ cpu4 2205 1143 1056 6 0 2202 3 0
+ cpu5 2165 1121 1038 6 0 2162 3 0
+ cpu6 2183 1127 1050 6 0 2180 3 0
+ cpu7 2193 1133 1052 8 0 2187 6 0
+ cpu8 2165 1115 1032 18 0 2156 9 0
+ cpu9 2301 1252 1033 16 0 2293 8 0
+ cpu10 2197 1138 1041 18 0 2187 10 0
+ cpu11 2273 1185 1062 26 0 2260 13 0
+ cpu12 2186 1125 1043 18 0 2177 9 0
+ cpu13 2161 1115 1030 16 0 2153 8 0
+ cpu14 2206 1153 1033 20 0 2196 10 0
+ cpu15 2163 1115 1032 16 0 2155 8 0
+
+In the output above, for vcpu0, there have been 6839 dispatches since
+statistics were enabled. 4126 of those dispatches were on the same
+physical cpu as the last time. 2683 were on a different core, but within
+the same chip, while 30 dispatches were on a different chip compared to
+its last dispatch.
+
+Also, out of the total of 6839 dispatches, we see that there have been
+6821 dispatches on the vcpu's home node, while 18 dispatches were
+outside its home node, on a neighbouring chip.