summaryrefslogtreecommitdiff
path: root/Documentation/mm/unevictable-lru.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/mm/unevictable-lru.rst')
-rw-r--r--Documentation/mm/unevictable-lru.rst151
1 files changed, 78 insertions, 73 deletions
diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst
index 4a0e158aa9ce..53e59433497a 100644
--- a/Documentation/mm/unevictable-lru.rst
+++ b/Documentation/mm/unevictable-lru.rst
@@ -12,7 +12,7 @@ Introduction
This document describes the Linux memory manager's "Unevictable LRU"
infrastructure and the use of this to manage several types of "unevictable"
-pages.
+folios.
The document attempts to provide the overall rationale behind this mechanism
and the rationale for some of the design decisions that drove the
@@ -27,8 +27,8 @@ The Unevictable LRU
===================
The Unevictable LRU facility adds an additional LRU list to track unevictable
-pages and to hide these pages from vmscan. This mechanism is based on a patch
-by Larry Woodman of Red Hat to address several scalability problems with page
+folios and to hide these folios from vmscan. This mechanism is based on a patch
+by Larry Woodman of Red Hat to address several scalability problems with folio
reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.
@@ -52,40 +52,41 @@ The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future.
-The Unevictable LRU Page List
------------------------------
+The Unevictable LRU Folio List
+------------------------------
-The Unevictable LRU page list is a lie. It was never an LRU-ordered list, but a
-companion to the LRU-ordered anonymous and file, active and inactive page lists;
-and now it is not even a page list. But following familiar convention, here in
-this document and in the source, we often imagine it as a fifth LRU page list.
+The Unevictable LRU folio list is a lie. It was never an LRU-ordered
+list, but a companion to the LRU-ordered anonymous and file, active and
+inactive folio lists; and now it is not even a folio list. But following
+familiar convention, here in this document and in the source, we often
+imagine it as a fifth LRU folio list.
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
-called the "unevictable" list and an associated page flag, PG_unevictable, to
-indicate that the page is being managed on the unevictable list.
+called the "unevictable" list and an associated folio flag, PG_unevictable, to
+indicate that the folio is being managed on the unevictable list.
The PG_unevictable flag is analogous to, and mutually exclusive with, the
-PG_active flag in that it indicates on which LRU list a page resides when
+PG_active flag in that it indicates on which LRU list a folio resides when
PG_lru is set.
-The Unevictable LRU infrastructure maintains unevictable pages as if they were
+The Unevictable LRU infrastructure maintains unevictable folios as if they were
on an additional LRU list for a few reasons:
- (1) We get to "treat unevictable pages just like we treat other pages in the
+ (1) We get to "treat unevictable folios just like we treat other folios in the
system - which means we get to use the same code to manipulate them, the
same code to isolate them (for migrate, etc.), the same code to keep track
of the statistics, etc..." [Rik van Riel]
- (2) We want to be able to migrate unevictable pages between nodes for memory
+ (2) We want to be able to migrate unevictable folios between nodes for memory
defragmentation, workload management and memory hotplug. The Linux kernel
- can only migrate pages that it can successfully isolate from the LRU
+ can only migrate folios that it can successfully isolate from the LRU
lists (or "Movable" pages: outside of consideration here). If we were to
- maintain pages elsewhere than on an LRU-like list, where they can be
- detected by isolate_lru_page(), we would prevent their migration.
+ maintain folios elsewhere than on an LRU-like list, where they can be
+ detected by folio_isolate_lru(), we would prevent their migration.
-The unevictable list does not differentiate between file-backed and anonymous,
-swap-backed pages. This differentiation is only important while the pages are,
-in fact, evictable.
+The unevictable list does not differentiate between file-backed and
+anonymous, swap-backed folios. This differentiation is only important
+while the folios are, in fact, evictable.
The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.
@@ -158,7 +159,7 @@ These are currently used in three places in the kernel:
Detecting Unevictable Pages
---------------------------
-The function page_evictable() in mm/internal.h determines whether a page is
+The function folio_evictable() in mm/internal.h determines whether a folio is
evictable or not using the query function outlined above [see section
:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
to check the AS_UNEVICTABLE flag.
@@ -167,7 +168,7 @@ For address spaces that are so marked after being populated (as SHM regions
might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate
the page tables for the region as does, for example, mlock(), nor need it make
any special effort to push any pages in the SHM_LOCK'd area to the unevictable
-list. Instead, vmscan will do this if and when it encounters the pages during
+list. Instead, vmscan will do this if and when it encounters the folios during
a reclamation scan.
On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan
@@ -176,41 +177,43 @@ condition is keeping them unevictable. If an unevictable region is destroyed,
the pages are also "rescued" from the unevictable list in the process of
freeing them.
-page_evictable() also checks for mlocked pages by testing an additional page
-flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
-faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
+folio_evictable() also checks for mlocked folios by calling
+folio_test_mlocked(), which is set when a folio is faulted into a
+VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
-Vmscan's Handling of Unevictable Pages
---------------------------------------
+Vmscan's Handling of Unevictable Folios
+---------------------------------------
-If unevictable pages are culled in the fault path, or moved to the unevictable
-list at mlock() or mmap() time, vmscan will not encounter the pages until they
+If unevictable folios are culled in the fault path, or moved to the unevictable
+list at mlock() or mmap() time, vmscan will not encounter the folios until they
have become evictable again (via munlock() for example) and have been "rescued"
from the unevictable list. However, there may be situations where we decide,
-for the sake of expediency, to leave an unevictable page on one of the regular
+for the sake of expediency, to leave an unevictable folio on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
-pages in all of the shrink_{active|inactive|page}_list() functions and will
-"cull" such pages that it encounters: that is, it diverts those pages to the
+folios in all of the shrink_{active|inactive|page}_list() functions and will
+"cull" such folios that it encounters: that is, it diverts those folios to the
unevictable list for the memory cgroup and node being scanned.
-There may be situations where a page is mapped into a VM_LOCKED VMA, but the
-page is not marked as PG_mlocked. Such pages will make it all the way to
-shrink_active_list() or shrink_page_list() where they will be detected when
-vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The page
-is culled to the unevictable list when it is released by the shrinker.
+There may be situations where a folio is mapped into a VM_LOCKED VMA,
+but the folio does not have the mlocked flag set. Such folios will make
+it all the way to shrink_active_list() or shrink_page_list() where they
+will be detected when vmscan walks the reverse map in folio_referenced()
+or try_to_unmap(). The folio is culled to the unevictable list when it
+is released by the shrinker.
-To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
-using putback_lru_page() - the inverse operation to isolate_lru_page() - after
-dropping the page lock. Because the condition which makes the page unevictable
-may change once the page is unlocked, __pagevec_lru_add_fn() will recheck the
-unevictable state of a page before placing it on the unevictable list.
+To "cull" an unevictable folio, vmscan simply puts the folio back on
+the LRU list using folio_putback_lru() - the inverse operation to
+folio_isolate_lru() - after dropping the folio lock. Because the
+condition which makes the folio unevictable may change once the folio
+is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state
+of a folio before placing it on the unevictable list.
MLOCKED Pages
=============
-The unevictable page list is also useful for mlock(), in addition to ramfs and
+The unevictable folio list is also useful for mlock(), in addition to ramfs and
SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
NOMMU situations, all mappings are effectively mlocked.
@@ -295,7 +298,7 @@ treated as a no-op and mlock_fixup() simply returns.
If the VMA passes some filtering as described in "Filtering Special VMAs"
below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
off a subset of the VMA if the range does not cover the entire VMA. Any pages
-already present in the VMA are then marked as mlocked by mlock_page() via
+already present in the VMA are then marked as mlocked by mlock_folio() via
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range().
Before returning from the system call, do_mlock() or mlockall() will call
@@ -308,22 +311,22 @@ do end up getting faulted into this VM_LOCKED VMA, they will be handled in the
fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled.
For each PTE (or PMD) being faulted into a VMA, the page add rmap function
-calls mlock_vma_page(), which calls mlock_page() when the VMA is VM_LOCKED
+calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED
(unless it is a PTE mapping of a part of a transparent huge page). Or when
-it is a newly allocated anonymous page, lru_cache_add_inactive_or_unevictable()
-calls mlock_new_page() instead: similar to mlock_page(), but can make better
+it is a newly allocated anonymous page, folio_add_lru_vma() calls
+mlock_new_folio() instead: similar to mlock_folio(), but can make better
judgments, since this page is held exclusively and known not to be on LRU yet.
-mlock_page() sets PageMlocked immediately, then places the page on the CPU's
-mlock pagevec, to batch up the rest of the work to be done under lru_lock by
-__mlock_page(). __mlock_page() sets PageUnevictable, initializes mlock_count
+mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's
+mlock folio batch, to batch up the rest of the work to be done under lru_lock by
+__mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count
and moves the page to unevictable state ("the unevictable LRU", but with
-mlock_count in place of LRU threading). Or if the page was already PageLRU
-and PageUnevictable and PageMlocked, it simply increments the mlock_count.
+mlock_count in place of LRU threading). Or if the page was already PG_lru
+and PG_unevictable and PG_mlocked, it simply increments the mlock_count.
But in practice that may not work ideally: the page may not yet be on an LRU, or
it may have been temporarily isolated from LRU. In such cases the mlock_count
-field cannot be touched, but will be set to 0 later when __pagevec_lru_add_fn()
+field cannot be touched, but will be set to 0 later when __munlock_folio()
returns the page to "LRU". Races prohibit mlock_count from being set to 1 then:
rather than risk stranding a page indefinitely as unevictable, always err with
mlock_count on the low side, so that when munlocked the page will be rescued to
@@ -370,20 +373,21 @@ Because of the VMA filtering discussed above, VM_LOCKED will not be set in
any "special" VMAs. So, those VMAs will be ignored for munlock.
If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
-specified range. All pages in the VMA are then munlocked by munlock_page() via
+specified range. All pages in the VMA are then munlocked by munlock_folio() via
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same
function used when mlocking a VMA range, with new flags for the VMA indicating
that it is munlock() being performed.
-munlock_page() uses the mlock pagevec to batch up work to be done under
-lru_lock by __munlock_page(). __munlock_page() decrements the page's
-mlock_count, and when that reaches 0 it clears PageMlocked and clears
-PageUnevictable, moving the page from unevictable state to inactive LRU.
+munlock_folio() uses the mlock pagevec to batch up work to be done
+under lru_lock by __munlock_folio(). __munlock_folio() decrements the
+folio's mlock_count, and when that reaches 0 it clears the mlocked flag
+and clears the unevictable flag, moving the folio from unevictable state
+to the inactive LRU.
-But in practice that may not work ideally: the page may not yet have reached
+But in practice that may not work ideally: the folio may not yet have reached
"the unevictable LRU", or it may have been temporarily isolated from it. In
those cases its mlock_count field is unusable and must be assumed to be 0: so
-that the page will be rescued to an evictable LRU, then perhaps be mlocked
+that the folio will be rescued to an evictable LRU, then perhaps be mlocked
again later if vmscan finds it in a VM_LOCKED VMA.
@@ -410,7 +414,7 @@ However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA,
before mlocking any pages already present, if one of those pages were migrated
before mlock_pte_range() reached it, it would get counted twice in mlock_count.
To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO,
-so that mlock_vma_page() will skip it.
+so that mlock_vma_folio() will skip it.
To complete page migration, we place the old and new pages back onto the LRU
afterwards. The "unneeded" page - old page on success, new page on failure -
@@ -483,18 +487,19 @@ Before the unevictable/mlock changes, mlocking did not mark the pages in any
way, so unmapping them required no processing.
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
-munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
+munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page).
-munlock_page() uses the mlock pagevec to batch up work to be done under
-lru_lock by __munlock_page(). __munlock_page() decrements the page's
-mlock_count, and when that reaches 0 it clears PageMlocked and clears
-PageUnevictable, moving the page from unevictable state to inactive LRU.
+munlock_folio() uses the mlock pagevec to batch up work to be done
+under lru_lock by __munlock_folio(). __munlock_folio() decrements the
+folio's mlock_count, and when that reaches 0 it clears the mlocked flag
+and clears the unevictable flag, moving the folio from unevictable state
+to the inactive LRU.
-But in practice that may not work ideally: the page may not yet have reached
+But in practice that may not work ideally: the folio may not yet have reached
"the unevictable LRU", or it may have been temporarily isolated from it. In
those cases its mlock_count field is unusable and must be assumed to be 0: so
-that the page will be rescued to an evictable LRU, then perhaps be mlocked
+that the folio will be rescued to an evictable LRU, then perhaps be mlocked
again later if vmscan finds it in a VM_LOCKED VMA.
@@ -507,7 +512,7 @@ which had been Copied-On-Write from the file pages now being truncated.
Mlocked pages can be munlocked and deleted in this way: like with munmap(),
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
-munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
+munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page).
However, if there is a racing munlock(), since mlock_vma_pages_range() starts
@@ -515,7 +520,7 @@ munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages
present, if one of those pages were unmapped by truncation or hole punch before
mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA,
and would not be counted out of mlock_count. In this rare case, a page may
-still appear as PageMlocked after it has been fully unmapped: and it is left to
+still appear as PG_mlocked after it has been fully unmapped: and it is left to
release_pages() (or __page_cache_release()) to clear it and update statistics
before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared,
which is usually 0).
@@ -527,7 +532,7 @@ Page Reclaim in shrink_*_list()
vmscan's shrink_active_list() culls any obviously unevictable pages -
i.e. !page_evictable(page) pages - diverting those to the unevictable list.
However, shrink_active_list() only sees unevictable pages that made it onto the
-active/inactive LRU lists. Note that these pages do not have PageUnevictable
+active/inactive LRU lists. Note that these pages do not have PG_unevictable
set - otherwise they would be on the unevictable list and shrink_active_list()
would never see them.
@@ -549,6 +554,6 @@ and node unevictable list.
rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or
shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(),
-check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page()
+check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio()
to correct them. Such pages are culled to the unevictable list when released
by the shrinker.