From fe2cce15d6821aea1766708a1cf031071cec815f Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 24 Feb 2021 12:01:22 -0800 Subject: mm, slub: remove slub_memcg_sysfs boot param and CONFIG_SLUB_MEMCG_SYSFS_ON The boot param and config determine the value of memcg_sysfs_enabled, which is unused since commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") as there are no per-memcg kmem caches anymore. Link: https://lkml.kernel.org/r/20210127124745.7928-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka Reviewed-by: David Hildenbrand Acked-by: Roman Gushchin Acked-by: David Rientjes Reviewed-by: Miaohe Lin Cc: Christoph Lameter Cc: Pekka Enberg Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 8 -------- 1 file changed, 8 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0ac883777318..19cfa8b9eb60 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4899,14 +4899,6 @@ last alloc / free. For more information see Documentation/vm/slub.rst. - slub_memcg_sysfs= [MM, SLUB] - Determines whether to enable sysfs directories for - memory cgroup sub-caches. 1 to enable, 0 to disable. - The default is determined by CONFIG_SLUB_MEMCG_SYSFS_ON. - Enabling this can lead to a very high number of debug - directories and files being created under - /sys/kernel/slub. - slub_max_order= [MM, SLUB] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory -- cgit v1.2.3 From b6038942480e574c697ea1a80019bbe586c1d654 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 24 Feb 2021 12:03:55 -0800 Subject: mm: memcg: add swapcache stat for memcg v2 This patch adds swapcache stat for the cgroup v2. The swapcache represents the memory that is accounted against both the memory and the swap limit of the cgroup. The main motivation behind exposing the swapcache stat is for enabling users to gracefully migrate from cgroup v1's memsw counter to cgroup v2's memory and swap counters. Cgroup v1's memsw limit allows users to limit the memory+swap usage of a workload but without control on the exact proportion of memory and swap. Cgroup v2 provides separate limits for memory and swap which enables more control on the exact usage of memory and swap individually for the workload. With some little subtleties, the v1's memsw limit can be switched with the sum of the v2's memory and swap limits. However the alternative for memsw usage is not yet available in cgroup v2. Exposing per-cgroup swapcache stat enables that alternative. Adding the memory usage and swap usage and subtracting the swapcache will approximate the memsw usage. This will help in the transparent migration of the workloads depending on memsw usage and limit to v2' memory and swap counters. The reasons these applications are still interested in this approximate memsw usage are: (1) these applications are not really interested in two separate memory and swap usage metrics. A single usage metric is more simple to use and reason about for them. (2) The memsw usage metric hides the underlying system's swap setup from the applications. Applications with multiple instances running in a datacenter with heterogeneous systems (some have swap and some don't) will keep seeing a consistent view of their usage. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Muchun Song Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ drivers/base/node.c | 6 ++++++ include/linux/mmzone.h | 3 +++ include/linux/swap.h | 6 +++++- mm/memcontrol.c | 3 +++ mm/migrate.c | 6 ++++++ mm/swap_state.c | 28 ++-------------------------- mm/vmstat.c | 3 +++ 8 files changed, 32 insertions(+), 27 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index c513eafaddea..9acc57f68f31 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1299,6 +1299,10 @@ PAGE_SIZE multiple when read back. Amount of cached filesystem data that was modified and is currently being written back to disk + swapcached + Amount of swap cached in memory. The swapcache is accounted + against both memory and swap usage. + anon_thp Amount of memory used in anonymous mappings backed by transparent hugepages diff --git a/drivers/base/node.c b/drivers/base/node.c index d02d86aec19f..f449dbb2c746 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -372,14 +372,19 @@ static ssize_t node_read_meminfo(struct device *dev, struct pglist_data *pgdat = NODE_DATA(nid); struct sysinfo i; unsigned long sreclaimable, sunreclaimable; + unsigned long swapcached = 0; si_meminfo_node(&i, nid); sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B); sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B); +#ifdef CONFIG_SWAP + swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE); +#endif len = sysfs_emit_at(buf, len, "Node %d MemTotal: %8lu kB\n" "Node %d MemFree: %8lu kB\n" "Node %d MemUsed: %8lu kB\n" + "Node %d SwapCached: %8lu kB\n" "Node %d Active: %8lu kB\n" "Node %d Inactive: %8lu kB\n" "Node %d Active(anon): %8lu kB\n" @@ -391,6 +396,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(i.totalram), nid, K(i.freeram), nid, K(i.totalram - i.freeram), + nid, K(swapcached), nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) + node_page_state(pgdat, NR_ACTIVE_FILE)), nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) + diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 66d68e5d5a0f..fc99e9241846 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -206,6 +206,9 @@ enum node_stat_item { NR_KERNEL_SCS_KB, /* measured in KiB */ #endif NR_PAGETABLE, /* used for pagetables */ +#ifdef CONFIG_SWAP + NR_SWAPCACHE, +#endif NR_VM_NODE_STAT_ITEMS }; diff --git a/include/linux/swap.h b/include/linux/swap.h index 3f1f7ae0fbe9..0d64a2bc7e00 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -408,7 +408,11 @@ extern struct address_space *swapper_spaces[]; #define swap_address_space(entry) \ (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ >> SWAP_ADDRESS_SPACE_SHIFT]) -extern unsigned long total_swapcache_pages(void); +static inline unsigned long total_swapcache_pages(void) +{ + return global_node_page_state(NR_SWAPCACHE); +} + extern void show_swap_cache_info(void); extern int add_to_swap(struct page *page); extern void *get_shadow_from_swap_cache(swp_entry_t entry); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5435b370c929..58edfb6614b8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1521,6 +1521,9 @@ static const struct memory_stat memory_stats[] = { { "file_mapped", NR_FILE_MAPPED }, { "file_dirty", NR_FILE_DIRTY }, { "file_writeback", NR_WRITEBACK }, +#ifdef CONFIG_SWAP + { "swapcached", NR_SWAPCACHE }, +#endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE { "anon_thp", NR_ANON_THPS }, { "file_thp", NR_FILE_THPS }, diff --git a/mm/migrate.c b/mm/migrate.c index 4ec727adfaa2..62b81d5257aa 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -500,6 +500,12 @@ int migrate_page_move_mapping(struct address_space *mapping, __mod_lruvec_state(old_lruvec, NR_SHMEM, -nr); __mod_lruvec_state(new_lruvec, NR_SHMEM, nr); } +#ifdef CONFIG_SWAP + if (PageSwapCache(page)) { + __mod_lruvec_state(old_lruvec, NR_SWAPCACHE, -nr); + __mod_lruvec_state(new_lruvec, NR_SWAPCACHE, nr); + } +#endif if (dirty && mapping_can_writeback(mapping)) { __mod_lruvec_state(old_lruvec, NR_FILE_DIRTY, -nr); __mod_zone_page_state(oldzone, NR_ZONE_WRITE_PENDING, -nr); diff --git a/mm/swap_state.c b/mm/swap_state.c index 807b00eae969..c1a648d9092b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -68,32 +68,6 @@ static struct { unsigned long find_total; } swap_cache_info; -unsigned long total_swapcache_pages(void) -{ - unsigned int i, j, nr; - unsigned long ret = 0; - struct address_space *spaces; - struct swap_info_struct *si; - - for (i = 0; i < MAX_SWAPFILES; i++) { - swp_entry_t entry = swp_entry(i, 1); - - /* Avoid get_swap_device() to warn for bad swap entry */ - if (!swp_swap_info(entry)) - continue; - /* Prevent swapoff to free swapper_spaces */ - si = get_swap_device(entry); - if (!si) - continue; - nr = nr_swapper_spaces[i]; - spaces = swapper_spaces[i]; - for (j = 0; j < nr; j++) - ret += spaces[j].nrpages; - put_swap_device(si); - } - return ret; -} - static atomic_t swapin_readahead_hits = ATOMIC_INIT(4); void show_swap_cache_info(void) @@ -163,6 +137,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, address_space->nrexceptional -= nr_shadows; address_space->nrpages += nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); + __mod_lruvec_page_state(page, NR_SWAPCACHE, nr); ADD_CACHE_INFO(add_total, nr); unlock: xas_unlock_irq(&xas); @@ -203,6 +178,7 @@ void __delete_from_swap_cache(struct page *page, address_space->nrexceptional += nr; address_space->nrpages -= nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); + __mod_lruvec_page_state(page, NR_SWAPCACHE, -nr); ADD_CACHE_INFO(del_total, nr); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 07dc0af50cf0..a0e949542204 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1215,6 +1215,9 @@ const char * const vmstat_text[] = { "nr_shadow_call_stack", #endif "nr_page_table_pages", +#ifdef CONFIG_SWAP + "nr_swapcached", +#endif /* enum writeback_stat_item counters */ "nr_dirty_threshold", -- cgit v1.2.3 From 519983645a9f2ec339cabfa0c6ef7b09be985dd0 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 24 Feb 2021 12:09:15 -0800 Subject: mm/vmscan: restore zone_reclaim_mode ABI I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl. Like a good kernel developer, I also went to go update the documentation. I noticed that the bits in the documentation didn't match the bits in the #defines. The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, however implicitly checked when checking 'node_reclaim_mode==0'. The RECLAIM_ZONE #define was removed in a cleanup. That, by itself is fine. But, when the bit was removed (bit 0) the _other_ bit locations also got changed. That's not OK because the bit values are documented to mean one specific thing. Users surely do not expect the meaning to change from kernel to kernel. The end result is that if someone had a script that did: sysctl vm.zone_reclaim_mode=1 it would have gone from enabling node reclaim for clean unmapped pages to writing out pages during node reclaim after the commit in question. That's not great. Put the bits back the way they were and add a comment so something like this is a bit harder to do again. Update the documentation to make it clear that the first bit is ignored. Link: https://lkml.kernel.org/r/20210219172555.FF0CDF23@viggo.jf.intel.com Signed-off-by: Dave Hansen Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") Reviewed-by: Ben Widawsky Reviewed-by: Oscar Salvador Acked-by: David Rientjes Acked-by: Christoph Lameter Cc: Alex Shi Cc: Daniel Wagner Cc: "Tobin C. Harding" Cc: Christoph Lameter Cc: Andrew Morton Cc: Huang Ying Cc: Dan Williams Cc: Qian Cai Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 10 +++++----- mm/vmscan.c | 9 +++++++-- 2 files changed, 12 insertions(+), 7 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e35a3f2fb006..586cd4b86428 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -983,11 +983,11 @@ that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. -zone_reclaim may be enabled if it's known that the workload is partitioned -such that each partition fits within a NUMA node and that accessing remote -memory would cause a measurable performance reduction. The page allocator -will then reclaim easily reusable pages (those page cache pages that are -currently not used) before allocating off node pages. +Consider enabling one or more zone_reclaim mode bits if it's known that the +workload is partitioned such that each partition fits within a NUMA node +and that accessing remote memory would cause a measurable performance +reduction. The page allocator will take additional actions before +allocating off node pages. Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone diff --git a/mm/vmscan.c b/mm/vmscan.c index fea6b43bc1f9..562e87cbd7a1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4085,8 +4085,13 @@ module_init(kswapd_init) */ int node_reclaim_mode __read_mostly; -#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */ -#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */ +/* + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl + * ABI. New bits are OK, but existing bits can never change. + */ +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ /* * Priority for NODE_RECLAIM. This determines the fraction of pages -- cgit v1.2.3