summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst1
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst38
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt24
-rw-r--r--Documentation/admin-guide/media/mgb4.rst374
-rw-r--r--Documentation/admin-guide/media/pci-cardlist.rst1
-rw-r--r--Documentation/admin-guide/media/v4l-drivers.rst1
-rw-r--r--Documentation/admin-guide/media/visl.rst6
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst124
-rw-r--r--Documentation/admin-guide/mm/ksm.rst11
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst89
-rw-r--r--Documentation/admin-guide/mm/userfaultfd.rst35
-rw-r--r--Documentation/admin-guide/module-signing.rst17
12 files changed, 665 insertions, 56 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index ff456871bf4b..ca7d9402f6be 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -551,6 +551,7 @@ memory.stat file includes following statistics:
event happens each time a page is unaccounted from the
cgroup.
swap # of bytes of swap usage
+ swapcached # of bytes of swap cached in memory
dirty # of bytes that are waiting to get written back to the disk.
writeback # of bytes of file/anon cache that are queued for syncing to
disk.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f081459a5be..3f85254f3cef 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
+ memory_hugetlb_accounting
+ Count HugeTLB memory usage towards the cgroup's overall
+ memory usage for the memory controller (for the purpose of
+ statistics reporting and memory protetion). This is a new
+ behavior that could regress existing setups, so it must be
+ explicitly opted in with this mount option.
+
+ A few caveats to keep in mind:
+
+ * There is no HugeTLB pool management involved in the memory
+ controller. The pre-allocated pool does not belong to anyone.
+ Specifically, when a new HugeTLB folio is allocated to
+ the pool, it is not accounted for from the perspective of the
+ memory controller. It is only charged to a cgroup when it is
+ actually used (for e.g at page fault time). Host memory
+ overcommit management has to consider this when configuring
+ hard limits. In general, HugeTLB pool management should be
+ done via other mechanisms (such as the HugeTLB controller).
+ * Failure to charge a HugeTLB folio to the memory controller
+ results in SIGBUS. This could happen even if the HugeTLB pool
+ still has pages available (but the cgroup limit is hit and
+ reclaim attempt fails).
+ * Charging HugeTLB memory towards the memory controller affects
+ memory protection and reclaim dynamics. Any userspace tuning
+ (of low, min limits for e.g) needs to take this into account.
+ * HugeTLB pages utilized while this option is not selected
+ will not be tracked by the memory controller (even if cgroup
+ v2 is remounted later on).
+
Organizing Processes and Threads
--------------------------------
@@ -1539,6 +1568,15 @@ PAGE_SIZE multiple when read back.
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ thp_swpout (npn)
+ Number of transparent hugepages which are swapout in one piece
+ without splitting.
+
+ thp_swpout_fallback (npn)
+ Number of transparent hugepages which were split before swapout.
+ Usually because failed to allocate some continuous swap space
+ for the huge page.
+
memory.numa_stat
A read-only nested-keyed file which exists on non-root cgroups.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 07625c60aa8a..65731b060e3f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1335,6 +1335,7 @@
earlyprintk=dbgp[debugController#]
earlyprintk=pciserial[,force],bus:device.function[,baudrate]
earlyprintk=xdbc[xhciController#]
+ earlyprintk=bios
earlyprintk is useful when the kernel crashes before
the normal console is initialized. It is not enabled by
@@ -1365,6 +1366,8 @@
The sclp output can only be used on s390.
+ The bios output can only be used on SuperH.
+
The optional "force" to "pciserial" enables use of a
PCI device even when its classcode is not of the
UART class.
@@ -2224,7 +2227,7 @@
forcing Dual Address Cycle for PCI cards supporting
greater than 32-bit addressing.
- iommu.strict= [ARM64, X86] Configure TLB invalidation behaviour
+ iommu.strict= [ARM64, X86, S390] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
@@ -3330,6 +3333,11 @@
mga= [HW,DRM]
+ microcode.force_minrev= [X86]
+ Format: <bool>
+ Enable or disable the microcode minimal revision
+ enforcement for the runtime microcode loader.
+
min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
physical address is ignored.
@@ -3588,6 +3596,13 @@
[NFS] set the TCP port on which the NFSv4 callback
channel should listen.
+ nfs.delay_retrans=
+ [NFS] specifies the number of times the NFSv4 client
+ retries the request before returning an EAGAIN error,
+ after a reply of NFS4ERR_DELAY from the server.
+ Only applies if the softerr mount option is enabled,
+ and the specified value is >= 0.
+
nfs.enable_ino64=
[NFS] enable 64-bit inode numbers.
If zero, the NFS client will fake up a 32-bit inode
@@ -5679,9 +5694,10 @@
s390_iommu= [HW,S390]
Set s390 IOTLB flushing mode
strict
- With strict flushing every unmap operation will result in
- an IOTLB flush. Default is lazy flushing before reuse,
- which is faster.
+ With strict flushing every unmap operation will result
+ in an IOTLB flush. Default is lazy flushing before
+ reuse, which is faster. Deprecated, equivalent to
+ iommu.strict=1.
s390_iommu_aperture= [KNL,S390]
Specifies the size of the per device DMA address space
diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst
new file mode 100644
index 000000000000..2977f74d7e26
--- /dev/null
+++ b/Documentation/admin-guide/media/mgb4.rst
@@ -0,0 +1,374 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mgb4 sysfs interface
+====================
+
+The mgb4 driver provides a sysfs interface, that is used to configure video
+stream related parameters (some of them must be set properly before the v4l2
+device can be opened) and obtain the video device/stream status.
+
+There are two types of parameters - global / PCI card related, found under
+``/sys/class/video4linux/videoX/device`` and module specific found under
+``/sys/class/video4linux/videoX``.
+
+
+Global (PCI card) parameters
+============================
+
+**module_type** (R):
+ Module type.
+
+ | 0 - No module present
+ | 1 - FPDL3
+ | 2 - GMSL
+
+**module_version** (R):
+ Module version number. Zero in case of a missing module.
+
+**fw_type** (R):
+ Firmware type.
+
+ | 1 - FPDL3
+ | 2 - GMSL
+
+**fw_version** (R):
+ Firmware version number.
+
+**serial_number** (R):
+ Card serial number. The format is::
+
+ PRODUCT-REVISION-SERIES-SERIAL
+
+ where each component is a 8b number.
+
+
+Common FPDL3/GMSL input parameters
+==================================
+
+**input_id** (R):
+ Input number ID, zero based.
+
+**oldi_lane_width** (RW):
+ Number of deserializer output lanes.
+
+ | 0 - single
+ | 1 - dual (default)
+
+**color_mapping** (RW):
+ Mapping of the incoming bits in the signal to the colour bits of the pixels.
+
+ | 0 - OLDI/JEIDA
+ | 1 - SPWG/VESA (default)
+
+**link_status** (R):
+ Video link status. If the link is locked, chips are properly connected and
+ communicating at the same speed and protocol. The link can be locked without
+ an active video stream.
+
+ A value of 0 is equivalent to the V4L2_IN_ST_NO_SYNC flag of the V4L2
+ VIDIOC_ENUMINPUT status bits.
+
+ | 0 - unlocked
+ | 1 - locked
+
+**stream_status** (R):
+ Video stream status. A stream is detected if the link is locked, the input
+ pixel clock is running and the DE signal is moving.
+
+ A value of 0 is equivalent to the V4L2_IN_ST_NO_SIGNAL flag of the V4L2
+ VIDIOC_ENUMINPUT status bits.
+
+ | 0 - not detected
+ | 1 - detected
+
+**video_width** (R):
+ Video stream width. This is the actual width as detected by the HW.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in the width
+ field of the v4l2_bt_timings struct.
+
+**video_height** (R):
+ Video stream height. This is the actual height as detected by the HW.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in the height
+ field of the v4l2_bt_timings struct.
+
+**vsync_status** (R):
+ The type of VSYNC pulses as detected by the video format detector.
+
+ The value is equivalent to the flags returned by VIDIOC_QUERY_DV_TIMINGS in
+ the polarities field of the v4l2_bt_timings struct.
+
+ | 0 - active low
+ | 1 - active high
+ | 2 - not available
+
+**hsync_status** (R):
+ The type of HSYNC pulses as detected by the video format detector.
+
+ The value is equivalent to the flags returned by VIDIOC_QUERY_DV_TIMINGS in
+ the polarities field of the v4l2_bt_timings struct.
+
+ | 0 - active low
+ | 1 - active high
+ | 2 - not available
+
+**vsync_gap_length** (RW):
+ If the incoming video signal does not contain synchronization VSYNC and
+ HSYNC pulses, these must be generated internally in the FPGA to achieve
+ the correct frame ordering. This value indicates, how many "empty" pixels
+ (pixels with deasserted Data Enable signal) are necessary to generate the
+ internal VSYNC pulse.
+
+**hsync_gap_length** (RW):
+ If the incoming video signal does not contain synchronization VSYNC and
+ HSYNC pulses, these must be generated internally in the FPGA to achieve
+ the correct frame ordering. This value indicates, how many "empty" pixels
+ (pixels with deasserted Data Enable signal) are necessary to generate the
+ internal HSYNC pulse. The value must be greater than 1 and smaller than
+ vsync_gap_length.
+
+**pclk_frequency** (R):
+ Input pixel clock frequency in kHz.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the pixelclock field of the v4l2_bt_timings struct.
+
+ *Note: The frequency_range parameter must be set properly first to get
+ a valid frequency here.*
+
+**hsync_width** (R):
+ Width of the HSYNC signal in PCLK clock ticks.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hsync field of the v4l2_bt_timings struct.
+
+**vsync_width** (R):
+ Width of the VSYNC signal in PCLK clock ticks.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vsync field of the v4l2_bt_timings struct.
+
+**hback_porch** (R):
+ Number of PCLK pulses between deassertion of the HSYNC signal and the first
+ valid pixel in the video line (marked by DE=1).
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hbackporch field of the v4l2_bt_timings struct.
+
+**hfront_porch** (R):
+ Number of PCLK pulses between the end of the last valid pixel in the video
+ line (marked by DE=1) and assertion of the HSYNC signal.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hfrontporch field of the v4l2_bt_timings struct.
+
+**vback_porch** (R):
+ Number of video lines between deassertion of the VSYNC signal and the video
+ line with the first valid pixel (marked by DE=1).
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vbackporch field of the v4l2_bt_timings struct.
+
+**vfront_porch** (R):
+ Number of video lines between the end of the last valid pixel line (marked
+ by DE=1) and assertion of the VSYNC signal.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vfrontporch field of the v4l2_bt_timings struct.
+
+**frequency_range** (RW)
+ PLL frequency range of the OLDI input clock generator. The PLL frequency is
+ derived from the Pixel Clock Frequency (PCLK) and is equal to PCLK if
+ oldi_lane_width is set to "single" and PCLK/2 if oldi_lane_width is set to
+ "dual".
+
+ | 0 - PLL < 50MHz (default)
+ | 1 - PLL >= 50MHz
+
+ *Note: This parameter can not be changed while the input v4l2 device is
+ open.*
+
+
+Common FPDL3/GMSL output parameters
+===================================
+
+**output_id** (R):
+ Output number ID, zero based.
+
+**video_source** (RW):
+ Output video source. If set to 0 or 1, the source is the corresponding card
+ input and the v4l2 output devices are disabled. If set to 2 or 3, the source
+ is the corresponding v4l2 video output device. The default is
+ the corresponding v4l2 output, i.e. 2 for OUT1 and 3 for OUT2.
+
+ | 0 - input 0
+ | 1 - input 1
+ | 2 - v4l2 output 0
+ | 3 - v4l2 output 1
+
+ *Note: This parameter can not be changed while ANY of the input/output v4l2
+ devices is open.*
+
+**display_width** (RW):
+ Display width. There is no autodetection of the connected display, so the
+ proper value must be set before the start of streaming. The default width
+ is 1280.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**display_height** (RW):
+ Display height. There is no autodetection of the connected display, so the
+ proper value must be set before the start of streaming. The default height
+ is 640.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**frame_rate** (RW):
+ Output video frame rate in frames per second. The default frame rate is
+ 60Hz.
+
+**hsync_polarity** (RW):
+ HSYNC signal polarity.
+
+ | 0 - active low (default)
+ | 1 - active high
+
+**vsync_polarity** (RW):
+ VSYNC signal polarity.
+
+ | 0 - active low (default)
+ | 1 - active high
+
+**de_polarity** (RW):
+ DE signal polarity.
+
+ | 0 - active low
+ | 1 - active high (default)
+
+**pclk_frequency** (RW):
+ Output pixel clock frequency. Allowed values are between 25000-190000(kHz)
+ and there is a non-linear stepping between two consecutive allowed
+ frequencies. The driver finds the nearest allowed frequency to the given
+ value and sets it. When reading this property, you get the exact
+ frequency set by the driver. The default frequency is 70000kHz.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**hsync_width** (RW):
+ Width of the HSYNC signal in pixels. The default value is 16.
+
+**vsync_width** (RW):
+ Width of the VSYNC signal in video lines. The default value is 2.
+
+**hback_porch** (RW):
+ Number of PCLK pulses between deassertion of the HSYNC signal and the first
+ valid pixel in the video line (marked by DE=1). The default value is 32.
+
+**hfront_porch** (RW):
+ Number of PCLK pulses between the end of the last valid pixel in the video
+ line (marked by DE=1) and assertion of the HSYNC signal. The default value
+ is 32.
+
+**vback_porch** (RW):
+ Number of video lines between deassertion of the VSYNC signal and the video
+ line with the first valid pixel (marked by DE=1). The default value is 2.
+
+**vfront_porch** (RW):
+ Number of video lines between the end of the last valid pixel line (marked
+ by DE=1) and assertion of the VSYNC signal. The default value is 2.
+
+
+FPDL3 specific input parameters
+===============================
+
+**fpdl3_input_width** (RW):
+ Number of deserializer input lines.
+
+ | 0 - auto (default)
+ | 1 - single
+ | 2 - dual
+
+FPDL3 specific output parameters
+================================
+
+**fpdl3_output_width** (RW):
+ Number of serializer output lines.
+
+ | 0 - auto (default)
+ | 1 - single
+ | 2 - dual
+
+GMSL specific input parameters
+==============================
+
+**gmsl_mode** (RW):
+ GMSL speed mode.
+
+ | 0 - 12Gb/s (default)
+ | 1 - 6Gb/s
+ | 2 - 3Gb/s
+ | 3 - 1.5Gb/s
+
+**gmsl_stream_id** (RW):
+ The GMSL multi-stream contains up to four video streams. This parameter
+ selects which stream is captured by the video input. The value is the
+ zero-based index of the stream. The default stream id is 0.
+
+ *Note: This parameter can not be changed while the input v4l2 device is
+ open.*
+
+**gmsl_fec** (RW):
+ GMSL Forward Error Correction (FEC).
+
+ | 0 - disabled
+ | 1 - enabled (default)
+
+
+====================
+mgb4 mtd partitions
+====================
+
+The mgb4 driver creates a MTD device with two partitions:
+ - mgb4-fw.X - FPGA firmware.
+ - mgb4-data.X - Factory settings, e.g. card serial number.
+
+The *mgb4-fw* partition is writable and is used for FW updates, *mgb4-data* is
+read-only. The *X* attached to the partition name represents the card number.
+Depending on the CONFIG_MTD_PARTITIONED_MASTER kernel configuration, you may
+also have a third partition named *mgb4-flash* available in the system. This
+partition represents the whole, unpartitioned, card's FLASH memory and one should
+not fiddle with it...
+
+====================
+mgb4 iio (triggers)
+====================
+
+The mgb4 driver creates an Industrial I/O (IIO) device that provides trigger and
+signal level status capability. The following scan elements are available:
+
+**activity**:
+ The trigger levels and pending status.
+
+ | bit 1 - trigger 1 pending
+ | bit 2 - trigger 2 pending
+ | bit 5 - trigger 1 level
+ | bit 6 - trigger 2 level
+
+**timestamp**:
+ The trigger event timestamp.
+
+The iio device can operate either in "raw" mode where you can fetch the signal
+levels (activity bits 5 and 6) using sysfs access or in triggered buffer mode.
+In the triggered buffer mode you can follow the signal level changes (activity
+bits 1 and 2) using the iio device in /dev. If you enable the timestamps, you
+will also get the exact trigger event time that can be matched to a video frame
+(every mgb4 video frame has a timestamp with the same clock source).
+
+*Note: although the activity sample always contains all the status bits, it makes
+no sense to get the pending bits in raw mode or the level bits in the triggered
+buffer mode - the values do not represent valid data in such case.*
diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst
index 42528795d4da..7d8e3c8987db 100644
--- a/Documentation/admin-guide/media/pci-cardlist.rst
+++ b/Documentation/admin-guide/media/pci-cardlist.rst
@@ -77,6 +77,7 @@ ipu3-cio2 Intel ipu3-cio2 driver
ivtv Conexant cx23416/cx23415 MPEG encoder/decoder
ivtvfb Conexant cx23415 framebuffer
mantis MANTIS based cards
+mgb4 Digiteq Automotive MGB4 frame grabber
mxb Siemens-Nixdorf 'Multimedia eXtension Board'
netup-unidvb NetUP Universal DVB card
ngene Micronas nGene
diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst
index 1c41f87c3917..61283d67ceef 100644
--- a/Documentation/admin-guide/media/v4l-drivers.rst
+++ b/Documentation/admin-guide/media/v4l-drivers.rst
@@ -17,6 +17,7 @@ Video4Linux (V4L) driver-specific documentation
imx7
ipu3
ivtv
+ mgb4
omap3isp
omap4_camera
philips
diff --git a/Documentation/admin-guide/media/visl.rst b/Documentation/admin-guide/media/visl.rst
index 7d2dc78341c9..4328c6c72d30 100644
--- a/Documentation/admin-guide/media/visl.rst
+++ b/Documentation/admin-guide/media/visl.rst
@@ -78,7 +78,7 @@ The trace events are defined on a per-codec basis, e.g.:
.. code-block:: bash
- $ ls /sys/kernel/debug/tracing/events/ | grep visl
+ $ ls /sys/kernel/tracing/events/ | grep visl
visl_fwht_controls
visl_h264_controls
visl_hevc_controls
@@ -90,13 +90,13 @@ For example, in order to dump HEVC SPS data:
.. code-block:: bash
- $ echo 1 > /sys/kernel/debug/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
+ $ echo 1 > /sys/kernel/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
The SPS data will be dumped to the trace buffer, i.e.:
.. code-block:: bash
- $ cat /sys/kernel/debug/tracing/trace
+ $ cat /sys/kernel/tracing/trace
video_parameter_set_id 0
seq_parameter_set_id 0
pic_width_in_luma_samples 1920
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 8da1b7281827..da94feb97ed1 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -20,18 +20,18 @@ DAMON provides below interfaces for different users.
you can write and use your personalized DAMON sysfs wrapper programs that
reads/writes the sysfs files instead of you. The `DAMON user space tool
<https://github.com/awslabs/damo>`_ is one example of such programs.
-- *debugfs interface. (DEPRECATED!)*
- :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
- <sysfs_interface>`. This is deprecated, so users should move to the
- :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
- move, please report your usecase to damon@lists.linux.dev and
- linux-mm@kvack.org.
- *Kernel Space Programming Interface.*
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
users can utilize every feature of DAMON most flexibly and efficiently by
writing kernel space DAMON application programs for you. You can even extend
DAMON for various address spaces. For detail, please refer to the interface
:doc:`document </mm/damon/api>`.
+- *debugfs interface. (DEPRECATED!)*
+ :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
+ <sysfs_interface>`. This is deprecated, so users should move to the
+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
+ move, please report your usecase to damon@lists.linux.dev and
+ linux-mm@kvack.org.
.. _sysfs_interface:
@@ -76,7 +76,7 @@ comma (","). ::
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ schemes/nr_schemes
- │ │ │ │ │ │ 0/action
+ │ │ │ │ │ │ 0/action,apply_interval_us
│ │ │ │ │ │ │ access_pattern/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
@@ -105,14 +105,12 @@ having the root permission could use this directory.
kdamonds/
---------
-The monitoring-related information including request specifications and results
-are called DAMON context. DAMON executes each context with a kernel thread
-called kdamond, and multiple kdamonds could run in parallel.
-
Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
-controlling the kdamonds exist. In the beginning, this directory has only one
-file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number
-of child directories named ``0`` to ``N-1``. Each directory represents each
+controlling the kdamonds (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details) exists. In the beginning, this directory has only one file,
+``nr_kdamonds``. Writing a number (``N``) to the file creates the number of
+child directories named ``0`` to ``N-1``. Each directory represents each
kdamond.
kdamonds/<N>/
@@ -150,9 +148,10 @@ kdamonds/<N>/contexts/
In the beginning, this directory has only one file, ``nr_contexts``. Writing a
number (``N``) to the file creates the number of child directories named as
-``0`` to ``N-1``. Each directory represents each monitoring context. At the
-moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
-be written to the file.
+``0`` to ``N-1``. Each directory represents each monitoring context (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details). At the moment, only one context per kdamond is supported, so only
+``0`` or ``1`` can be written to the file.
.. _sysfs_contexts:
@@ -270,8 +269,8 @@ schemes/<N>/
------------
In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
-(``action``) exist.
+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
+(``action`` and ``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -297,6 +296,9 @@ Note that support of each action depends on the running DAMON operations set
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.
+The ``apply_interval_us`` file is for setting and getting the scheme's
+:ref:`apply_interval <damon_design_damos>` in microseconds.
+
schemes/<N>/access_pattern/
---------------------------
@@ -392,7 +394,7 @@ pages of all memory cgroups except ``/having_care_already``.::
echo N > 1/matching
Note that ``anon`` and ``memcg`` filters are currently supported only when
-``paddr`` `implementation <sysfs_contexts>` is being used.
+``paddr`` :ref:`implementation <sysfs_contexts>` is being used.
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
are not counted as the scheme has tried to those, while regions that filtered
@@ -430,9 +432,9 @@ that reading it returns the total size of the scheme tried regions, and creates
directories named integer starting from ``0`` under this directory. Each
directory contains files exposing detailed information about each of the memory
region that the corresponding scheme's ``action`` has tried to be applied under
-this directory, during next :ref:`aggregation interval
-<sysfs_monitoring_attrs>`. The information includes address range,
-``nr_accesses``, and ``age`` of the region.
+this directory, during next :ref:`apply interval <damon_design_damos>` of the
+corresponding scheme. The information includes address range, ``nr_accesses``,
+and ``age`` of the region.
Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
file will only update the ``total_bytes`` file, and will not create the
@@ -495,6 +497,62 @@ Please note that it's highly recommended to use user space tools like `damo
<https://github.com/awslabs/damo>`_ rather than manually reading and writing
the files as above. Above is only for an example.
+.. _tracepoint:
+
+Tracepoints for Monitoring Results
+==================================
+
+Users can get the monitoring results via the :ref:`tried_regions
+<sysfs_schemes_tried_regions>`. The interface is useful for getting a
+snapshot, but it could be inefficient for fully recording all the monitoring
+results. For the purpose, two trace points, namely ``damon:damon_aggregated``
+and ``damon:damos_before_apply``, are provided. ``damon:damon_aggregated``
+provides the whole monitoring results, while ``damon:damos_before_apply``
+provides the monitoring results for regions that each DAMON-based Operation
+Scheme (:ref:`DAMOS <damon_design_damos>`) is gonna be applied. Hence,
+``damon:damos_before_apply`` is more useful for recording internal behavior of
+DAMOS, or DAMOS target access
+:ref:`pattern <damon_design_damos_access_pattern>` based query-like efficient
+monitoring results recording.
+
+While the monitoring is turned on, you could record the tracepoint events and
+show results using tracepoint supporting tools like ``perf``. For example::
+
+ # echo on > monitor_on
+ # perf record -e damon:damon_aggregated &
+ # sleep 5
+ # kill 9 $(pidof perf)
+ # echo off > monitor_on
+ # perf script
+ kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
+ [...]
+
+Each line of the perf script output represents each monitoring region. The
+first five fields are as usual other tracepoint outputs. The sixth field
+(``target_id=X``) shows the ide of the monitoring target of the region. The
+seventh field (``nr_regions=X``) shows the total number of monitoring regions
+for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
+(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
+``nr_accesses`` of the region (refer to
+:ref:`design <damon_design_region_based_sampling>` for more details of the
+counter). Finally the tenth field (``X``) shows the ``age`` of the region
+(refer to :ref:`design <damon_design_age_tracking>` for more details of the
+counter).
+
+If the event was ``damon:damos_beofre_apply``, the ``perf script`` output would
+be somewhat like below::
+
+ kdamond.0 47293 [000] 80801.060214: damon:damos_before_apply: ctx_idx=0 scheme_idx=0 target_idx=0 nr_regions=11 121932607488-135128711168: 0 136
+ [...]
+
+Each line of the output represents each monitoring region that each DAMON-based
+Operation Scheme was about to be applied at the traced time. The first five
+fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
+of the scheme in the list of the contexts of the context's kdamond, the index
+of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
+addition to the output of ``damon_aggregated`` tracepoint.
+
+
.. _debugfs_interface:
debugfs Interface (DEPRECATED!)
@@ -790,23 +848,3 @@ directory by putting the name of the context to the ``rm_contexts`` file. ::
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
root directory only.
-
-
-.. _tracepoint:
-
-Tracepoint for Monitoring Results
-=================================
-
-Users can get the monitoring results via the :ref:`tried_regions
-<sysfs_schemes_tried_regions>` or a tracepoint, ``damon:damon_aggregated``.
-While the tried regions directory is useful for getting a snapshot, the
-tracepoint is useful for getting a full record of the results. While the
-monitoring is turned on, you could record the tracepoint events and show
-results using tracepoint supporting tools like ``perf``. For example::
-
- # echo on > monitor_on
- # perf record -e damon:damon_aggregated &
- # sleep 5
- # kill 9 $(pidof perf)
- # echo off > monitor_on
- # perf script
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 776f244bdae4..e59231ac6bb7 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -155,6 +155,15 @@ stable_node_chains_prune_millisecs
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.
+smart_scan
+ Historically KSM checked every candidate page for each scan. It did
+ not take into account historic information. When smart scan is
+ enabled, pages that have previously not been de-duplicated get
+ skipped. How often these pages are skipped depends on how often
+ de-duplication has already been tried and failed. By default this
+ optimization is enabled. The ``pages_skipped`` metric shows how
+ effective the setting is.
+
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
general_profit
@@ -169,6 +178,8 @@ pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
+pages_skipped
+ how many pages did the "smart" page scanning algorithm skip
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index c8f380271cad..fe17cf210426 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -227,3 +227,92 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.
+
+Pagemap Scan IOCTL
+==================
+
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
+clear the info about page table entries. The following operations are supported
+in this IOCTL:
+
+- Scan the address range and get the memory ranges matching the provided criteria.
+ This is performed when the output buffer is specified.
+- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
+ the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
+ non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
+ used with or without ``PM_SCAN_CHECK_WPASYNC``.
+- Both of those operations can be combined into one atomic operation where we can
+ get and write protect the pages as well.
+
+Following flags about pages are currently supported:
+
+- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
+- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
+- ``PAGE_IS_FILE`` - Page is file backed
+- ``PAGE_IS_PRESENT`` - Page is present in the memory
+- ``PAGE_IS_SWAPPED`` - Page is in swapped
+- ``PAGE_IS_PFNZERO`` - Page has zero PFN
+- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
+
+The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
+
+ 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
+ field. This field will be helpful in recognizing the structure if extensions
+ are done later.
+ 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
+ and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
+ operation is optionally performed depending upon if the output buffer is
+ provided or not.
+ 3. The range is specified through ``start`` and ``end``.
+ 4. The walk can abort before visiting the complete range such as the user buffer
+ can get full etc. The walk ending address is specified in``end_walk``.
+ 5. The output buffer of ``struct page_region`` array and size is specified in
+ ``vec`` and ``vec_len``.
+ 6. The optional maximum requested pages are specified in the ``max_pages``.
+ 7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
+ ``category_inverted`` and ``return_mask``.
+
+Find pages which have been written and WP them as well::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
+ ..
+ .category_mask = PAGE_IS_WRITTEN,
+ .return_mask = PAGE_IS_WRITTEN,
+ };
+
+Find pages which have been written, are file backed, not swapped and either
+present or huge::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = 0,
+ ..
+ .category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
+ .category_inverted = PAGE_IS_SWAPPED,
+ .category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ .return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
+ PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ };
+
+The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
+of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
+the user can find the true soft-dirty pages in case of normal pages. (There may
+still be extra dirty pages reported for THP or Hugetlb pages.)
+
+"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
+implement memory dirty tracking in userspace:
+
+ 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
+ 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
+ are set by ``UFFDIO_API`` IOCTL.
+ 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
+ through ``UFFDIO_REGISTER`` IOCTL.
+ 4. Then any part of the registered memory or the whole memory region must
+ be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
+ or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
+ same operation. The former is better in terms of performance.
+ 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
+ have been written to since they were last marked and/or optionally write protect
+ the pages as well.
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 4349a8c2b978..203e26da5f92 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -244,6 +244,41 @@ write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.
+If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set,
+any vma registered with write-protection will work in async mode rather
+than the default sync mode.
+
+In async mode, there will be no message generated when a write operation
+happens, meanwhile the write-protection will be resolved automatically by
+the kernel. It can be seen as a more accurate version of soft-dirty
+tracking and it can be different in a few ways:
+
+ - The dirty result will not be affected by vma changes (e.g. vma
+ merging) because the dirty is only tracked by the pte.
+
+ - It supports range operations by default, so one can enable tracking on
+ any range of memory as long as page aligned.
+
+ - Dirty information will not get lost if the pte was zapped due to
+ various reasons (e.g. during split of a shmem transparent huge page).
+
+ - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
+ set; dirty when uffd-wp bit cleared), it has different semantics on
+ some of the memory operations. For example: ``MADV_DONTNEED`` on
+ anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
+ dirtying of memory by dropping uffd-wp bit during the procedure.
+
+The user app can collect the "written/dirty" status by looking up the
+uffd-wp bit for the pages being interested in /proc/pagemap.
+
+The page will not be under track of uffd-wp async mode until the page is
+explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
+flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
+that was tracked by async mode userfaultfd-wp is invalid.
+
+When userfaultfd-wp async mode is used alone, it can be applied to all
+kinds of memory.
+
Memory Poisioning Emulation
---------------------------
diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst
index 2898b2703297..a8667a777490 100644
--- a/Documentation/admin-guide/module-signing.rst
+++ b/Documentation/admin-guide/module-signing.rst
@@ -28,10 +28,10 @@ trusted userspace bits.
This facility uses X.509 ITU-T standard certificates to encode the public keys
involved. The signatures are not themselves encoded in any industrial standard
-type. The facility currently only supports the RSA public key encryption
-standard (though it is pluggable and permits others to be used). The possible
-hash algorithms that can be used are SHA-1, SHA-224, SHA-256, SHA-384, and
-SHA-512 (the algorithm is selected by data in the signature).
+type. The built-in facility currently only supports the RSA & NIST P-384 ECDSA
+public key signing standard (though it is pluggable and permits others to be
+used). The possible hash algorithms that can be used are SHA-2 and SHA-3 of
+sizes 256, 384, and 512 (the algorithm is selected by data in the signature).
==========================
@@ -81,11 +81,12 @@ This has a number of options available:
sign the modules with:
=============================== ==========================================
- ``CONFIG_MODULE_SIG_SHA1`` :menuselection:`Sign modules with SHA-1`
- ``CONFIG_MODULE_SIG_SHA224`` :menuselection:`Sign modules with SHA-224`
``CONFIG_MODULE_SIG_SHA256`` :menuselection:`Sign modules with SHA-256`
``CONFIG_MODULE_SIG_SHA384`` :menuselection:`Sign modules with SHA-384`
``CONFIG_MODULE_SIG_SHA512`` :menuselection:`Sign modules with SHA-512`
+ ``CONFIG_MODULE_SIG_SHA3_256`` :menuselection:`Sign modules with SHA3-256`
+ ``CONFIG_MODULE_SIG_SHA3_384`` :menuselection:`Sign modules with SHA3-384`
+ ``CONFIG_MODULE_SIG_SHA3_512`` :menuselection:`Sign modules with SHA3-512`
=============================== ==========================================
The algorithm selected here will also be built into the kernel (rather
@@ -145,6 +146,10 @@ into vmlinux) using parameters in the::
file (which is also generated if it does not already exist).
+One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``) and ECDSA
+(``MODULE_SIG_KEY_TYPE_ECDSA``) to generate either RSA 4k or NIST
+P-384 keypair.
+
It is strongly recommended that you provide your own x509.genkey file.
Most notably, in the x509.genkey file, the req_distinguished_name section