From f7df2b1cf03af680354bd4300f48f7ea11316ce8 Mon Sep 17 00:00:00 2001 From: Zhenliang Wei Date: Fri, 5 Nov 2021 13:42:55 -0700 Subject: tools/vm/page_owner_sort.c: count and sort by mem When viewing page owner information, we may be more concerned about the total memory rather than the times of stack appears. Therefore, the following adjustments are made: 1. Added the statistics on the total number of pages. 2. Added the optional parameter "-m" to configure the program to sort by memory (total pages). The general output of page_owner is as follows: Page allocated via order XXX, ... PFN XXX ... // Detailed stack Page allocated via order XXX, ... PFN XXX ... // Detailed stack The original page_owner_sort ignores PFN rows, puts the remaining rows in buf, counts the times of buf, and finally sorts them according to the times. General output: XXX times: Page allocated via order XXX, ... // Detailed stack Now, we use regexp to extract the page order value from the buf, and count the total pages for the buf. General output: XXX times, XXX pages: Page allocated via order XXX, ... // Detailed stack By default, it is still sorted by the times of buf; If you want to sort by the pages nums of buf, use the new -m parameter. Link: https://lkml.kernel.org/r/1631678242-41033-1-git-send-email-weizhenliang@huawei.com Signed-off-by: Zhenliang Wei Cc: Tang Bin Cc: Zhang Shengju Cc: Zhenliang Wei Cc: Xiaoming Ni Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page_owner.rst | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) (limited to 'Documentation/vm') diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst index 2175465c9bf2..9837fc8147dd 100644 --- a/Documentation/vm/page_owner.rst +++ b/Documentation/vm/page_owner.rst @@ -85,5 +85,26 @@ Usage cat /sys/kernel/debug/page_owner > page_owner_full.txt ./page_owner_sort page_owner_full.txt sorted_page_owner.txt + The general output of ``page_owner_full.txt`` is as follows: + + Page allocated via order XXX, ... + PFN XXX ... + // Detailed stack + + Page allocated via order XXX, ... + PFN XXX ... + // Detailed stack + + The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows + in buf, uses regexp to extract the page order value, counts the times + and pages of buf, and finally sorts them according to the times. + See the result about who allocated each page - in the ``sorted_page_owner.txt``. + in the ``sorted_page_owner.txt``. General output: + + XXX times, XXX pages: + Page allocated via order XXX, ... + // Detailed stack + + By default, ``page_owner_sort`` is sorted according to the times of buf. + If you want to sort by the pages nums of buf, use the ``-m`` parameter. -- cgit v1.2.3 From ad782c48df326eb13cf5dec7aab571b44be3e415 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 5 Nov 2021 13:45:55 -0700 Subject: Documentation/vm: move user guides to admin-guide/mm/ Most memory management user guide documents are in 'admin-guide/mm/', but two of those are in 'vm/'. This moves the two docs into 'admin-guide/mm' for easier documents finding. Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/index.rst | 2 + Documentation/admin-guide/mm/swap_numa.rst | 80 +++++++++++++++ Documentation/admin-guide/mm/zswap.rst | 152 +++++++++++++++++++++++++++++ Documentation/vm/index.rst | 26 +---- Documentation/vm/swap_numa.rst | 80 --------------- Documentation/vm/zswap.rst | 152 ----------------------------- 6 files changed, 239 insertions(+), 253 deletions(-) create mode 100644 Documentation/admin-guide/mm/swap_numa.rst create mode 100644 Documentation/admin-guide/mm/zswap.rst delete mode 100644 Documentation/vm/swap_numa.rst delete mode 100644 Documentation/vm/zswap.rst (limited to 'Documentation/vm') diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index cbd19d5e625f..c21b5823f126 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,5 +37,7 @@ the Linux memory management. numaperf pagemap soft-dirty + swap_numa transhuge userfaultfd + zswap diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst new file mode 100644 index 000000000000..e0466f2db8fa --- /dev/null +++ b/Documentation/admin-guide/mm/swap_numa.rst @@ -0,0 +1,80 @@ +.. _swap_numa: + +=========================================== +Automatically bind swap device to numa node +=========================================== + +If the system has more than one swap device and swap device has the node +information, we can make use of this information to decide which swap +device to use in get_swap_pages() to get better performance. + + +How to use this feature +======================= + +Swap device has priority and that decides the order of it to be used. To make +use of automatically binding, there is no need to manipulate priority settings +for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and +swapB, with swapA attached to node 0 and swapB attached to node 1, are going +to be swapped on. Simply swapping them on by doing:: + + # swapon /dev/swapA + # swapon /dev/swapB + +Then node 0 will use the two swap devices in the order of swapA then swapB and +node 1 will use the two swap devices in the order of swapB then swapA. Note +that the order of them being swapped on doesn't matter. + +A more complex example on a 4 node machine. Assume 6 swap devices are going to +be swapped on: swapA and swapB are attached to node 0, swapC is attached to +node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. +The way to swap them on is the same as above:: + + # swapon /dev/swapA + # swapon /dev/swapB + # swapon /dev/swapC + # swapon /dev/swapD + # swapon /dev/swapE + # swapon /dev/swapF + +Then node 0 will use them in the order of:: + + swapA/swapB -> swapC -> swapD -> swapE -> swapF + +swapA and swapB will be used in a round robin mode before any other swap device. + +node 1 will use them in the order of:: + + swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of:: + + swapD/swapE -> swapA -> swapB -> swapC -> swapF + +Similaly, swapD and swapE will be used in a round robin mode before any +other swap devices. + +node 3 will use them in the order of:: + + swapF -> swapA -> swapB -> swapC -> swapD -> swapE + + +Implementation details +====================== + +The current code uses a priority based list, swap_avail_list, to decide +which swap device to use and if multiple swap devices share the same +priority, they are used round robin. This change here replaces the single +global swap_avail_list with a per-numa-node list, i.e. for each numa node, +it sees its own priority based list of available swap devices. Swap +device's priority can be promoted on its matching node's swap_avail_list. + +The current swap device's priority is set as: user can set a >=0 value, +or the system will pick one starting from -1 then downwards. The priority +value in the swap_avail_list is the negated value of the swap device's +due to plist being sorted from low to high. The new policy doesn't change +the semantics for priority >=0 cases, the previous starting from -1 then +downwards now becomes starting from -2 then downwards and -1 is reserved +as the promoted value. So if multiple swap devices are attached to the same +node, they will all be promoted to priority -1 on that node's plist and will +be used round robin before any other swap devices. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst new file mode 100644 index 000000000000..8edb8d578caf --- /dev/null +++ b/Documentation/admin-guide/mm/zswap.rst @@ -0,0 +1,152 @@ +.. _zswap: + +===== +zswap +===== + +Overview +======== + +Zswap is a lightweight compressed cache for swap pages. It takes pages that are +in the process of being swapped out and attempts to compress them into a +dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles +for potentially reduced swap I/O. This trade-off can also result in a +significant performance improvement if reads from the compressed cache are +faster than reads from a swap device. + +.. note:: + Zswap is a new feature as of v3.11 and interacts heavily with memory + reclaim. This interaction has not been fully explored on the large set of + potential configurations and workloads that exist. For this reason, zswap + is a work in progress and should be considered experimental. + + Some potential benefits: + +* Desktop/laptop users with limited RAM capacities can mitigate the + performance impact of swapping. +* Overcommitted guests that share a common I/O resource can + dramatically reduce their swap I/O pressure, avoiding heavy handed I/O + throttling by the hypervisor. This allows more work to get done with less + impact to the guest workload and guests sharing the I/O subsystem +* Users with SSDs as swap devices can extend the life of the device by + drastically reducing life-shortening writes. + +Zswap evicts pages from compressed cache on an LRU basis to the backing swap +device when the compressed pool reaches its size limit. This requirement had +been identified in prior community discussions. + +Whether Zswap is enabled at the boot time depends on whether +the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not. +This setting can then be overridden by providing the kernel command line +``zswap.enabled=`` option, for example ``zswap.enabled=0``. +Zswap can also be enabled and disabled at runtime using the sysfs interface. +An example command to enable zswap at runtime, assuming sysfs is mounted +at ``/sys``, is:: + + echo 1 > /sys/module/zswap/parameters/enabled + +When zswap is disabled at runtime it will stop storing pages that are +being swapped out. However, it will _not_ immediately write out or fault +back into memory all of the pages stored in the compressed pool. The +pages stored in zswap will remain in the compressed pool until they are +either invalidated or faulted back into memory. In order to force all +pages out of the compressed pool, a swapoff on the swap device(s) will +fault back into memory all swapped out pages, including those in the +compressed pool. + +Design +====== + +Zswap receives pages for compression through the Frontswap API and is able to +evict pages from its own compressed pool on an LRU basis and write them back to +the backing swap device in the case that the compressed pool is full. + +Zswap makes use of zpool for the managing the compressed memory pool. Each +allocation in zpool is not directly accessible by address. Rather, a handle is +returned by the allocation routine and that handle must be mapped before being +accessed. The compressed memory pool grows on demand and shrinks as compressed +pages are freed. The pool is not preallocated. By default, a zpool +of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, +but it can be overridden at boot time by setting the ``zpool`` attribute, +e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +``zpool`` attribute, e.g.:: + + echo zbud > /sys/module/zswap/parameters/zpool + +The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which +means the compression ratio will always be 2:1 or worse (because of half-full +zbud pages). The zsmalloc type zpool has a more complex compressed page +storage method, and it can achieve greater storage densities. However, +zsmalloc does not implement compressed page eviction, so once zswap fills it +cannot evict the oldest page, it can only reject new pages. + +When a swap page is passed from frontswap to zswap, zswap maintains a mapping +of the swap entry, a combination of the swap type and swap offset, to the zpool +handle that references that compressed swap page. This mapping is achieved +with a red-black tree per swap type. The swap offset is the search key for the +tree nodes. + +During a page fault on a PTE that is a swap entry, frontswap calls the zswap +load function to decompress the page into the page allocated by the page fault +handler. + +Once there are no PTEs referencing a swap page stored in zswap (i.e. the count +in the swap_map goes to 0) the swap code calls the zswap invalidate function, +via frontswap, to free the compressed entry. + +Zswap seeks to be simple in its policies. Sysfs attributes allow for one user +controlled policy: + +* max_pool_percent - The maximum percentage of memory that the compressed + pool can occupy. + +The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT`` +Kconfig option, but it can be overridden at boot time by setting the +``compressor`` attribute, e.g. ``zswap.compressor=lzo``. +It can also be changed at runtime using the sysfs "compressor" +attribute, e.g.:: + + echo lzo > /sys/module/zswap/parameters/compressor + +When the zpool and/or compressor parameter is changed at runtime, any existing +compressed pages are not modified; they are left in their own zpool. When a +request is made for a page in an old zpool, it is uncompressed using its +original compressor. Once all pages are removed from an old zpool, the zpool +and its compressor are freed. + +Some of the pages in zswap are same-value filled pages (i.e. contents of the +page have same value or repetitive pattern). These pages include zero-filled +pages and they are handled differently. During store operation, a page is +checked if it is a same-value filled page before compressing it. If true, the +compressed length of the page is set to zero and the pattern or same-filled +value is stored. + +Same-value filled pages identification feature is enabled by default and can be +disabled at boot time by setting the ``same_filled_pages_enabled`` attribute +to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and +disabled at runtime using the sysfs ``same_filled_pages_enabled`` +attribute, e.g.:: + + echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled + +When zswap same-filled page identification is disabled at runtime, it will stop +checking for the same-value filled pages during store operation. However, the +existing pages which are marked as same-value filled pages remain stored +unchanged in zswap until they are either loaded or invalidated. + +To prevent zswap from shrinking pool when zswap is full and there's a high +pressure on swap (this will result in flipping pages in and out zswap pool +without any real benefit but with a performance drop for the system), a +special parameter has been introduced to implement a sort of hysteresis to +refuse taking pages into zswap pool until it has sufficient space if the limit +has been hit. To set the threshold at which zswap would start accepting pages +again after it became full, use the sysfs ``accept_threshold_percent`` +attribute, e. g.:: + + echo 80 > /sys/module/zswap/parameters/accept_threshold_percent + +Setting this parameter to 100 will disable the hysteresis. + +A debugfs interface is provided for various statistic about pool size, number +of pages stored, same-value filled pages and various counters for the reasons +pages are rejected. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index b51f0d8992f8..6f5ffef4b716 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -3,27 +3,11 @@ Linux Memory Management Documentation ===================================== This is a collection of documents about the Linux memory management (mm) -subsystem. If you are looking for advice on simply allocating memory, -see the :ref:`memory_allocation`. - -User guides for MM features -=========================== - -The following documents provide guides for controlling and tuning -various features of the Linux memory management - -.. toctree:: - :maxdepth: 1 - - swap_numa - zswap - -Kernel developers MM documentation -================================== - -The below documents describe MM internals with different level of -details ranging from notes and mailing list responses to elaborate -descriptions of data structures and algorithms. +subsystem internals with different level of details ranging from notes and +mailing list responses for elaborating descriptions of data structures and +algorithms. If you are looking for advice on simply allocating memory, see the +:ref:`memory_allocation`. For controlling and tuning guides, see the +:doc:`admin guide <../admin-guide/mm/index>`. .. toctree:: :maxdepth: 1 diff --git a/Documentation/vm/swap_numa.rst b/Documentation/vm/swap_numa.rst deleted file mode 100644 index e0466f2db8fa..000000000000 --- a/Documentation/vm/swap_numa.rst +++ /dev/null @@ -1,80 +0,0 @@ -.. _swap_numa: - -=========================================== -Automatically bind swap device to numa node -=========================================== - -If the system has more than one swap device and swap device has the node -information, we can make use of this information to decide which swap -device to use in get_swap_pages() to get better performance. - - -How to use this feature -======================= - -Swap device has priority and that decides the order of it to be used. To make -use of automatically binding, there is no need to manipulate priority settings -for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and -swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing:: - - # swapon /dev/swapA - # swapon /dev/swapB - -Then node 0 will use the two swap devices in the order of swapA then swapB and -node 1 will use the two swap devices in the order of swapB then swapA. Note -that the order of them being swapped on doesn't matter. - -A more complex example on a 4 node machine. Assume 6 swap devices are going to -be swapped on: swapA and swapB are attached to node 0, swapC is attached to -node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above:: - - # swapon /dev/swapA - # swapon /dev/swapB - # swapon /dev/swapC - # swapon /dev/swapD - # swapon /dev/swapE - # swapon /dev/swapF - -Then node 0 will use them in the order of:: - - swapA/swapB -> swapC -> swapD -> swapE -> swapF - -swapA and swapB will be used in a round robin mode before any other swap device. - -node 1 will use them in the order of:: - - swapC -> swapA -> swapB -> swapD -> swapE -> swapF - -node 2 will use them in the order of:: - - swapD/swapE -> swapA -> swapB -> swapC -> swapF - -Similaly, swapD and swapE will be used in a round robin mode before any -other swap devices. - -node 3 will use them in the order of:: - - swapF -> swapA -> swapB -> swapC -> swapD -> swapE - - -Implementation details -====================== - -The current code uses a priority based list, swap_avail_list, to decide -which swap device to use and if multiple swap devices share the same -priority, they are used round robin. This change here replaces the single -global swap_avail_list with a per-numa-node list, i.e. for each numa node, -it sees its own priority based list of available swap devices. Swap -device's priority can be promoted on its matching node's swap_avail_list. - -The current swap device's priority is set as: user can set a >=0 value, -or the system will pick one starting from -1 then downwards. The priority -value in the swap_avail_list is the negated value of the swap device's -due to plist being sorted from low to high. The new policy doesn't change -the semantics for priority >=0 cases, the previous starting from -1 then -downwards now becomes starting from -2 then downwards and -1 is reserved -as the promoted value. So if multiple swap devices are attached to the same -node, they will all be promoted to priority -1 on that node's plist and will -be used round robin before any other swap devices. diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst deleted file mode 100644 index 8edb8d578caf..000000000000 --- a/Documentation/vm/zswap.rst +++ /dev/null @@ -1,152 +0,0 @@ -.. _zswap: - -===== -zswap -===== - -Overview -======== - -Zswap is a lightweight compressed cache for swap pages. It takes pages that are -in the process of being swapped out and attempts to compress them into a -dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles -for potentially reduced swap I/O. This trade-off can also result in a -significant performance improvement if reads from the compressed cache are -faster than reads from a swap device. - -.. note:: - Zswap is a new feature as of v3.11 and interacts heavily with memory - reclaim. This interaction has not been fully explored on the large set of - potential configurations and workloads that exist. For this reason, zswap - is a work in progress and should be considered experimental. - - Some potential benefits: - -* Desktop/laptop users with limited RAM capacities can mitigate the - performance impact of swapping. -* Overcommitted guests that share a common I/O resource can - dramatically reduce their swap I/O pressure, avoiding heavy handed I/O - throttling by the hypervisor. This allows more work to get done with less - impact to the guest workload and guests sharing the I/O subsystem -* Users with SSDs as swap devices can extend the life of the device by - drastically reducing life-shortening writes. - -Zswap evicts pages from compressed cache on an LRU basis to the backing swap -device when the compressed pool reaches its size limit. This requirement had -been identified in prior community discussions. - -Whether Zswap is enabled at the boot time depends on whether -the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not. -This setting can then be overridden by providing the kernel command line -``zswap.enabled=`` option, for example ``zswap.enabled=0``. -Zswap can also be enabled and disabled at runtime using the sysfs interface. -An example command to enable zswap at runtime, assuming sysfs is mounted -at ``/sys``, is:: - - echo 1 > /sys/module/zswap/parameters/enabled - -When zswap is disabled at runtime it will stop storing pages that are -being swapped out. However, it will _not_ immediately write out or fault -back into memory all of the pages stored in the compressed pool. The -pages stored in zswap will remain in the compressed pool until they are -either invalidated or faulted back into memory. In order to force all -pages out of the compressed pool, a swapoff on the swap device(s) will -fault back into memory all swapped out pages, including those in the -compressed pool. - -Design -====== - -Zswap receives pages for compression through the Frontswap API and is able to -evict pages from its own compressed pool on an LRU basis and write them back to -the backing swap device in the case that the compressed pool is full. - -Zswap makes use of zpool for the managing the compressed memory pool. Each -allocation in zpool is not directly accessible by address. Rather, a handle is -returned by the allocation routine and that handle must be mapped before being -accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool -of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, -but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs -``zpool`` attribute, e.g.:: - - echo zbud > /sys/module/zswap/parameters/zpool - -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. However, -zsmalloc does not implement compressed page eviction, so once zswap fills it -cannot evict the oldest page, it can only reject new pages. - -When a swap page is passed from frontswap to zswap, zswap maintains a mapping -of the swap entry, a combination of the swap type and swap offset, to the zpool -handle that references that compressed swap page. This mapping is achieved -with a red-black tree per swap type. The swap offset is the search key for the -tree nodes. - -During a page fault on a PTE that is a swap entry, frontswap calls the zswap -load function to decompress the page into the page allocated by the page fault -handler. - -Once there are no PTEs referencing a swap page stored in zswap (i.e. the count -in the swap_map goes to 0) the swap code calls the zswap invalidate function, -via frontswap, to free the compressed entry. - -Zswap seeks to be simple in its policies. Sysfs attributes allow for one user -controlled policy: - -* max_pool_percent - The maximum percentage of memory that the compressed - pool can occupy. - -The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT`` -Kconfig option, but it can be overridden at boot time by setting the -``compressor`` attribute, e.g. ``zswap.compressor=lzo``. -It can also be changed at runtime using the sysfs "compressor" -attribute, e.g.:: - - echo lzo > /sys/module/zswap/parameters/compressor - -When the zpool and/or compressor parameter is changed at runtime, any existing -compressed pages are not modified; they are left in their own zpool. When a -request is made for a page in an old zpool, it is uncompressed using its -original compressor. Once all pages are removed from an old zpool, the zpool -and its compressor are freed. - -Some of the pages in zswap are same-value filled pages (i.e. contents of the -page have same value or repetitive pattern). These pages include zero-filled -pages and they are handled differently. During store operation, a page is -checked if it is a same-value filled page before compressing it. If true, the -compressed length of the page is set to zero and the pattern or same-filled -value is stored. - -Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the ``same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and -disabled at runtime using the sysfs ``same_filled_pages_enabled`` -attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled - -When zswap same-filled page identification is disabled at runtime, it will stop -checking for the same-value filled pages during store operation. However, the -existing pages which are marked as same-value filled pages remain stored -unchanged in zswap until they are either loaded or invalidated. - -To prevent zswap from shrinking pool when zswap is full and there's a high -pressure on swap (this will result in flipping pages in and out zswap pool -without any real benefit but with a performance drop for the system), a -special parameter has been introduced to implement a sort of hysteresis to -refuse taking pages into zswap pool until it has sufficient space if the limit -has been hit. To set the threshold at which zswap would start accepting pages -again after it became full, use the sysfs ``accept_threshold_percent`` -attribute, e. g.:: - - echo 80 > /sys/module/zswap/parameters/accept_threshold_percent - -Setting this parameter to 100 will disable the hysteresis. - -A debugfs interface is provided for various statistic about pool size, number -of pages stored, same-value filled pages and various counters for the reasons -pages are rejected. -- cgit v1.2.3 From 876d0aac2e3af10fbaf1c7a814840c71e470dc5c Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 5 Nov 2021 13:46:01 -0700 Subject: docs/vm/damon: remove broken reference Building DAMON documents warns for a reference to nonexisting doc, as below: $ time make htmldocs [...] Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans' This fixes the warning by removing the wrong reference. Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/damon/index.rst | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation/vm') diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst index a2858baf3bf1..48c0bbff98b2 100644 --- a/Documentation/vm/damon/index.rst +++ b/Documentation/vm/damon/index.rst @@ -27,4 +27,3 @@ workloads and systems. faq design api - plans -- cgit v1.2.3 From c638072107f52ec35f292c97b6f3df9b9f2ed87d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 5 Nov 2021 13:47:03 -0700 Subject: Docs/DAMON: document physical memory monitoring support This updates the DAMON documents for the physical memory address space monitoring support. Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.org Signed-off-by: SeongJae Park Cc: Amit Shah Cc: Benjamin Herrenschmidt Cc: Brendan Higgins Cc: David Hildenbrand Cc: David Rienjes Cc: David Woodhouse Cc: Greg Thelen Cc: Jonathan Cameron Cc: Jonathan Corbet Cc: Leonard Foerster Cc: Marco Elver Cc: Markus Boehme Cc: Shakeel Butt Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/damon/usage.rst | 25 +++++++++++++++++++----- Documentation/vm/damon/design.rst | 29 +++++++++++++++++----------- Documentation/vm/damon/faq.rst | 5 ++--- 3 files changed, 40 insertions(+), 19 deletions(-) (limited to 'Documentation/vm') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index f7d5cfbb50c2..ed96bbf0daff 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -10,15 +10,16 @@ DAMON provides below three interfaces for different users. This is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for - special cases, though. It supports only virtual address spaces monitoring. + special cases, though. It supports both virtual and physical address spaces + monitoring. - *debugfs interface.* This is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special debugfs files. Therefore, you can write and use your personalized DAMON debugfs wrapper programs that reads/writes the debugfs files instead of you. The DAMON user space tool is also a reference - implementation of such programs. It supports only virtual address spaces - monitoring. + implementation of such programs. It supports both virtual and physical + address spaces monitoring. - *Kernel Space Programming Interface.* This is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by writing kernel space @@ -72,20 +73,34 @@ check it again:: # cat target_ids 42 4242 +Users can also monitor the physical memory address space of the system by +writing a special keyword, "``paddr\n``" to the file. Because physical address +space monitoring doesn't support multiple targets, reading the file will show a +fake value, ``42``, as below:: + + # cd /damon + # echo paddr > target_ids + # cat target_ids + 42 + Note that setting the target ids doesn't start the monitoring. Initial Monitoring Target Regions --------------------------------- -In case of the debugfs based monitoring, DAMON automatically sets and updates -the monitoring target regions so that entire memory mappings of target +In case of the virtual address space monitoring, DAMON automatically sets and +updates the monitoring target regions so that entire memory mappings of target processes can be covered. However, users can want to limit the monitoring region to specific address ranges, such as the heap, the stack, or specific file-mapped area. Or, some users can know the initial access pattern of their workloads and therefore want to set optimal initial regions for the 'adaptive regions adjustment'. +In contrast, DAMON do not automatically sets and updates the monitoring target +regions in case of physical memory monitoring. Therefore, users should set the +monitoring target regions by themselves. + In such cases, users can explicitly set the initial monitoring target regions as they want, by writing proper values to the ``init_regions`` file. Each line of the input should represent one region in below form.:: diff --git a/Documentation/vm/damon/design.rst b/Documentation/vm/damon/design.rst index b05159c295f4..210f0f50efd8 100644 --- a/Documentation/vm/damon/design.rst +++ b/Documentation/vm/damon/design.rst @@ -35,13 +35,17 @@ two parts: 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. -DAMON currently provides the implementation of the primitives for only the -virtual address spaces. Below two subsections describe how it works. +DAMON currently provides the implementations of the primitives for the physical +and virtual address spaces. Below two subsections describe how those work. VMA-based Target Address Range Construction ------------------------------------------- +This is only for the virtual address space primitives implementation. That for +the physical address space simply asks users to manually set the monitoring +target address ranges. + Only small parts in the super-huge virtual address space of the processes are mapped to the physical memory and accessed. Thus, tracking the unmapped address regions is just wasteful. However, because DAMON can deal with some @@ -71,15 +75,18 @@ to make a reasonable trade-off. Below shows this in detail:: PTE Accessed-bit Based Access Check ----------------------------------- -The implementation for the virtual address space uses PTE Accessed-bit for -basic access checks. It finds the relevant PTE Accessed bit from the address -by walking the page table for the target task of the address. In this way, the -implementation finds and clears the bit for next sampling target address and -checks whether the bit set again after one sampling period. This could disturb -other kernel subsystems using the Accessed bits, namely Idle page tracking and -the reclaim logic. To avoid such disturbances, DAMON makes it mutually -exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page -flags to solve the conflict with the reclaim logic, as Idle page tracking does. +Both of the implementations for physical and virtual address spaces use PTE +Accessed-bit for basic access checks. Only one difference is the way of +finding the relevant PTE Accessed bit(s) from the address. While the +implementation for the virtual address walks the page table for the target task +of the address, the implementation for the physical address walks every page +table having a mapping to the address. In this way, the implementations find +and clear the bit(s) for next sampling target address and checks whether the +bit(s) set again after one sampling period. This could disturb other kernel +subsystems using the Accessed bits, namely Idle page tracking and the reclaim +logic. To avoid such disturbances, DAMON makes it mutually exclusive with Idle +page tracking and uses ``PG_idle`` and ``PG_young`` page flags to solve the +conflict with the reclaim logic, as Idle page tracking does. Address Space Independent Core Mechanisms diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst index cb3d8b585a8b..11aea40eb328 100644 --- a/Documentation/vm/damon/faq.rst +++ b/Documentation/vm/damon/faq.rst @@ -36,10 +36,9 @@ constructions and actual access checks can be implemented and configured on the DAMON core by the users. In this way, DAMON users can monitor any address space with any access check technique. -Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based +Nonetheless, DAMON provides vma/rmap tracking and PTE Accessed bit check based implementations of the address space dependent functions for the virtual memory -by default, for a reference and convenient use. In near future, we will -provide those for physical memory address space. +and the physical memory by default, for a reference and convenient use. Can I simply monitor page granularity? -- cgit v1.2.3