summaryrefslogtreecommitdiff
path: root/Documentation/vm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/damon/api.rst20
-rw-r--r--Documentation/vm/damon/design.rst166
-rw-r--r--Documentation/vm/damon/faq.rst51
-rw-r--r--Documentation/vm/damon/index.rst30
-rw-r--r--Documentation/vm/hmm.rst19
-rw-r--r--Documentation/vm/hwpoison.rst1
-rw-r--r--Documentation/vm/index.rst1
-rw-r--r--Documentation/vm/memory-model.rst45
-rw-r--r--Documentation/vm/slub.rst10
-rw-r--r--Documentation/vm/unevictable-lru.rst33
-rw-r--r--Documentation/vm/zswap.rst4
11 files changed, 308 insertions, 72 deletions
diff --git a/Documentation/vm/damon/api.rst b/Documentation/vm/damon/api.rst
new file mode 100644
index 000000000000..08f34df45523
--- /dev/null
+++ b/Documentation/vm/damon/api.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+API Reference
+=============
+
+Kernel space programs can use every feature of DAMON using below APIs. All you
+need to do is including ``damon.h``, which is located in ``include/linux/`` of
+the source tree.
+
+Structures
+==========
+
+.. kernel-doc:: include/linux/damon.h
+
+
+Functions
+=========
+
+.. kernel-doc:: mm/damon/core.c
diff --git a/Documentation/vm/damon/design.rst b/Documentation/vm/damon/design.rst
new file mode 100644
index 000000000000..b05159c295f4
--- /dev/null
+++ b/Documentation/vm/damon/design.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+Design
+======
+
+Configurable Layers
+===================
+
+DAMON provides data access monitoring functionality while making the accuracy
+and the overhead controllable. The fundamental access monitorings require
+primitives that dependent on and optimized for the target address space. On
+the other hand, the accuracy and overhead tradeoff mechanism, which is the core
+of DAMON, is in the pure logic space. DAMON separates the two parts in
+different layers and defines its interface to allow various low level
+primitives implementations configurable with the core logic.
+
+Due to this separated design and the configurable interface, users can extend
+DAMON for any address space by configuring the core logics with appropriate low
+level primitive implementations. If appropriate one is not provided, users can
+implement the primitives on their own.
+
+For example, physical memory, virtual memory, swap space, those for specific
+processes, NUMA nodes, files, and backing memory devices would be supportable.
+Also, if some architectures or devices support special optimized access check
+primitives, those will be easily configurable.
+
+
+Reference Implementations of Address Space Specific Primitives
+==============================================================
+
+The low level primitives for the fundamental access monitoring are defined in
+two parts:
+
+1. Identification of the monitoring target address range for the address space.
+2. Access check of specific address range in the target space.
+
+DAMON currently provides the implementation of the primitives for only the
+virtual address spaces. Below two subsections describe how it works.
+
+
+VMA-based Target Address Range Construction
+-------------------------------------------
+
+Only small parts in the super-huge virtual address space of the processes are
+mapped to the physical memory and accessed. Thus, tracking the unmapped
+address regions is just wasteful. However, because DAMON can deal with some
+level of noise using the adaptive regions adjustment mechanism, tracking every
+mapping is not strictly required but could even incur a high overhead in some
+cases. That said, too huge unmapped areas inside the monitoring target should
+be removed to not take the time for the adaptive mechanism.
+
+For the reason, this implementation converts the complex mappings to three
+distinct regions that cover every mapped area of the address space. The two
+gaps between the three regions are the two biggest unmapped areas in the given
+address space. The two biggest unmapped areas would be the gap between the
+heap and the uppermost mmap()-ed region, and the gap between the lowermost
+mmap()-ed region and the stack in most of the cases. Because these gaps are
+exceptionally huge in usual address spaces, excluding these will be sufficient
+to make a reasonable trade-off. Below shows this in detail::
+
+ <heap>
+ <BIG UNMAPPED REGION 1>
+ <uppermost mmap()-ed region>
+ (small mmap()-ed regions and munmap()-ed regions)
+ <lowermost mmap()-ed region>
+ <BIG UNMAPPED REGION 2>
+ <stack>
+
+
+PTE Accessed-bit Based Access Check
+-----------------------------------
+
+The implementation for the virtual address space uses PTE Accessed-bit for
+basic access checks. It finds the relevant PTE Accessed bit from the address
+by walking the page table for the target task of the address. In this way, the
+implementation finds and clears the bit for next sampling target address and
+checks whether the bit set again after one sampling period. This could disturb
+other kernel subsystems using the Accessed bits, namely Idle page tracking and
+the reclaim logic. To avoid such disturbances, DAMON makes it mutually
+exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page
+flags to solve the conflict with the reclaim logic, as Idle page tracking does.
+
+
+Address Space Independent Core Mechanisms
+=========================================
+
+Below four sections describe each of the DAMON core mechanisms and the five
+monitoring attributes, ``sampling interval``, ``aggregation interval``,
+``regions update interval``, ``minimum number of regions``, and ``maximum
+number of regions``.
+
+
+Access Frequency Monitoring
+---------------------------
+
+The output of DAMON says what pages are how frequently accessed for a given
+duration. The resolution of the access frequency is controlled by setting
+``sampling interval`` and ``aggregation interval``. In detail, DAMON checks
+access to each page per ``sampling interval`` and aggregates the results. In
+other words, counts the number of the accesses to each page. After each
+``aggregation interval`` passes, DAMON calls callback functions that previously
+registered by users so that users can read the aggregated results and then
+clears the results. This can be described in below simple pseudo-code::
+
+ while monitoring_on:
+ for page in monitoring_target:
+ if accessed(page):
+ nr_accesses[page] += 1
+ if time() % aggregation_interval == 0:
+ for callback in user_registered_callbacks:
+ callback(monitoring_target, nr_accesses)
+ for page in monitoring_target:
+ nr_accesses[page] = 0
+ sleep(sampling interval)
+
+The monitoring overhead of this mechanism will arbitrarily increase as the
+size of the target workload grows.
+
+
+Region Based Sampling
+---------------------
+
+To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
+that assumed to have the same access frequencies into a region. As long as the
+assumption (pages in a region have the same access frequencies) is kept, only
+one page in the region is required to be checked. Thus, for each ``sampling
+interval``, DAMON randomly picks one page in each region, waits for one
+``sampling interval``, checks whether the page is accessed meanwhile, and
+increases the access frequency of the region if so. Therefore, the monitoring
+overhead is controllable by setting the number of regions. DAMON allows users
+to set the minimum and the maximum number of regions for the trade-off.
+
+This scheme, however, cannot preserve the quality of the output if the
+assumption is not guaranteed.
+
+
+Adaptive Regions Adjustment
+---------------------------
+
+Even somehow the initial monitoring target regions are well constructed to
+fulfill the assumption (pages in same region have similar access frequencies),
+the data access pattern can be dynamically changed. This will result in low
+monitoring quality. To keep the assumption as much as possible, DAMON
+adaptively merges and splits each region based on their access frequency.
+
+For each ``aggregation interval``, it compares the access frequencies of
+adjacent regions and merges those if the frequency difference is small. Then,
+after it reports and clears the aggregated access frequency of each region, it
+splits each region into two or three regions if the total number of regions
+will not exceed the user-specified maximum number of regions after the split.
+
+In this way, DAMON provides its best-effort quality and minimal overhead while
+keeping the bounds users set for their trade-off.
+
+
+Dynamic Target Space Updates Handling
+-------------------------------------
+
+The monitoring target address range could dynamically changed. For example,
+virtual memory could be dynamically mapped and unmapped. Physical memory could
+be hot-plugged.
+
+As the changes could be quite frequent in some cases, DAMON checks the dynamic
+memory mapping changes and applies it to the abstracted target area only for
+each of a user-specified time interval (``regions update interval``).
diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst
new file mode 100644
index 000000000000..cb3d8b585a8b
--- /dev/null
+++ b/Documentation/vm/damon/faq.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Frequently Asked Questions
+==========================
+
+Why a new subsystem, instead of extending perf or other user space tools?
+=========================================================================
+
+First, because it needs to be lightweight as much as possible so that it can be
+used online, any unnecessary overhead such as kernel - user space context
+switching cost should be avoided. Second, DAMON aims to be used by other
+programs including the kernel. Therefore, having a dependency on specific
+tools like perf is not desirable. These are the two biggest reasons why DAMON
+is implemented in the kernel space.
+
+
+Can 'idle pages tracking' or 'perf mem' substitute DAMON?
+=========================================================
+
+Idle page tracking is a low level primitive for access check of the physical
+address space. 'perf mem' is similar, though it can use sampling to minimize
+the overhead. On the other hand, DAMON is a higher-level framework for the
+monitoring of various address spaces. It is focused on memory management
+optimization and provides sophisticated accuracy/overhead handling mechanisms.
+Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
+DAMON's output, but cannot substitute DAMON.
+
+
+Does DAMON support virtual memory only?
+=======================================
+
+No. The core of the DAMON is address space independent. The address space
+specific low level primitive parts including monitoring target regions
+constructions and actual access checks can be implemented and configured on the
+DAMON core by the users. In this way, DAMON users can monitor any address
+space with any access check technique.
+
+Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
+implementations of the address space dependent functions for the virtual memory
+by default, for a reference and convenient use. In near future, we will
+provide those for physical memory address space.
+
+
+Can I simply monitor page granularity?
+======================================
+
+Yes. You can do so by setting the ``min_nr_regions`` attribute higher than the
+working set size divided by the page size. Because the monitoring target
+regions size is forced to be ``>=page size``, the region split will make no
+effect.
diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst
new file mode 100644
index 000000000000..a2858baf3bf1
--- /dev/null
+++ b/Documentation/vm/damon/index.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+DAMON is a data access monitoring framework subsystem for the Linux kernel.
+The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
+
+ - *accurate* (the monitoring output is useful enough for DRAM level memory
+ management; It might not appropriate for CPU Cache levels, though),
+ - *light-weight* (the monitoring overhead is low enough to be applied online),
+ and
+ - *scalable* (the upper-bound of the overhead is in constant range regardless
+ of the size of target workloads).
+
+Using this framework, therefore, the kernel's memory management mechanisms can
+make advanced decisions. Experimental memory management optimization works
+that incurring high data accesses monitoring overhead could implemented again.
+In user space, meanwhile, users who have some special workloads can write
+personalized applications for better understanding and optimizations of their
+workloads and systems.
+
+.. toctree::
+ :maxdepth: 2
+
+ faq
+ design
+ api
+ plans
diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
- ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+ ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
The lock can now be released.
+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
Memory cgroup (memcg) and rss accounting
========================================
diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst
index a5c884293dac..89b5f7a52077 100644
--- a/Documentation/vm/hwpoison.rst
+++ b/Documentation/vm/hwpoison.rst
@@ -180,7 +180,6 @@ Limitations
===========
- Not all page types are supported and never will. Most kernel internal
objects cannot be recovered, only LRU pages for now.
-- Right now hugepage support is missing.
---
Andi Kleen, Oct 2009
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..b51f0d8992f8 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -32,6 +32,7 @@ descriptions of data structures and algorithms.
arch_pgtable_helpers
balance
cleancache
+ damon/index
free_page_reporting
frontswap
highmem
diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst
index ce398a7dc6cd..30e8fbed6914 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges at
completely distinct addresses. And, don't forget about NUMA, where
different memory banks are attached to different CPUs.
-Linux abstracts this diversity using one of the three memory models:
-FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
+Linux abstracts this diversity using one of the two memory models:
+FLATMEM and SPARSEMEM. Each architecture defines what
memory models it supports, what the default memory model is and
whether it is possible to manually override that default.
-.. note::
- At time of this writing, DISCONTIGMEM is considered deprecated,
- although it is still in use by several architectures.
-
All the memory models track the status of physical page frames using
struct page arranged in one or more arrays.
@@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
The `ARCH_PFN_OFFSET` defines the first page frame number for
systems with physical memory starting at address different from 0.
-DISCONTIGMEM
-============
-
-The DISCONTIGMEM model treats the physical memory as a collection of
-`nodes` similarly to how Linux NUMA support does. For each node Linux
-constructs an independent memory management subsystem represented by
-`struct pglist_data` (or `pg_data_t` for short). Among other
-things, `pg_data_t` holds the `node_mem_map` array that maps
-physical pages belonging to that node. The `node_start_pfn` field of
-`pg_data_t` is the number of the first page frame belonging to that
-node.
-
-The architecture setup code should call :c:func:`free_area_init_node` for
-each node in the system to initialize the `pg_data_t` object and its
-`node_mem_map`.
-
-Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
-every physical page frame in a node has a `struct page` entry in the
-`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
-`flags` field of the `struct page` encodes the node number of the
-node hosting that page.
-
-The conversion between a PFN and the `struct page` in the
-DISCONTIGMEM model became slightly more complex as it has to determine
-which node hosts the physical page and which `pg_data_t` object
-holds the `struct page`.
-
-Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
-to convert PFN to the node number. The opposite conversion helper
-:c:func:`page_to_nid` is generic as it uses the node number encoded in
-page->flags.
-
-Once the node number is known, the PFN can be used to index
-appropriate `node_mem_map` array to access the `struct page` and
-the offset of the `struct page` from the `node_mem_map` plus
-`node_start_pfn` is the PFN of that page.
-
SPARSEMEM
=========
diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst
index 03f294a638bd..d3028554b1e9 100644
--- a/Documentation/vm/slub.rst
+++ b/Documentation/vm/slub.rst
@@ -181,7 +181,7 @@ SLUB Debug output
Here is a sample of slub debug output::
====================================================================
- BUG kmalloc-8: Redzone overwritten
+ BUG kmalloc-8: Right Redzone overwritten
--------------------------------------------------------------------
INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc
@@ -189,10 +189,10 @@ Here is a sample of slub debug output::
INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58
INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554
- Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
- Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005
- Redzone 0xc90f6d28: 00 cc cc cc .
- Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
+ Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
+ Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005
+ Redzone (0xc90f6d28): 00 cc cc cc .
+ Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[<c010523d>] dump_trace+0x63/0x1eb
[<c01053df>] show_trace_log_lvl+0x1a/0x2f
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics for the number of
mlocked pages. Note, however, that at this point we haven't checked whether
the page is mapped by other VM_LOCKED VMAs.
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
not be on an LRU list [more on these below]. However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
+isolate_lru_page() could fail, in which case we can't call page_mlock(). So,
we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have. If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have. If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
page statistics if it finds another VMA holding the page mlocked. If we fail
to isolate the page, we'll have left a potentially mlocked page on the LRU.
This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
holepunching, and truncation of file pages and their anonymous COWed pages.
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
---------------------------------
-.. warning::
- [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
- page_referenced() reverse map walker.
-
When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
Handling <munlock_munlockall_handling>` above] tries to munlock a
page, it needs to determine whether or not the page is mapped by any
VM_LOCKED VMA without actually attempting to unmap all PTEs from the
page. For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing. Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
large region or tearing down a large address space that has been mlocked via
mlockall(), overall this is a fairly rare event.
@@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
-recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
+recheck via page_mlock(). shrink_inactive_list() won't notice the latter,
but will pass on to shrink_page_list().
shrink_page_list() again culls obviously unevictable pages that it could
diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst
index d8d9fa4a1f0d..8edb8d578caf 100644
--- a/Documentation/vm/zswap.rst
+++ b/Documentation/vm/zswap.rst
@@ -10,7 +10,7 @@ Overview
Zswap is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
-for potentially reduced swap I/O.  This trade-off can also result in a
+for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are
faster than reads from a swap device.
@@ -26,7 +26,7 @@ faster than reads from a swap device.
performance impact of swapping.
* Overcommitted guests that share a common I/O resource can
dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
- throttling by the hypervisor. This allows more work to get done with less
+ throttling by the hypervisor. This allows more work to get done with less
impact to the guest workload and guests sharing the I/O subsystem
* Users with SSDs as swap devices can extend the life of the device by
drastically reducing life-shortening writes.