summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-06-29mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mmMuchun Song1-11/+14
When mm is NULL, we do not need to hold rcu lock and call css_tryget for the root memcg. And we also do not need to check !mm in every loop of while. So bail out early when !mm. Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: fix page charging in page replacementMuchun Song1-3/+5
Patch series "memcontrol code cleanup and simplification", v3. This patch (of 8): The pages aren't accounted at the root level, so do not charge the page to the root memcg in page replacement. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: fix root_mem_cgroup chargingMuchun Song1-5/+12
The below scenario can cause the page counters of the root_mem_cgroup to be out of balance. CPU0: CPU1: objcg = get_obj_cgroup_from_current() obj_cgroup_charge_pages(objcg) memcg_reparent_objcgs() // reparent to root_mem_cgroup WRITE_ONCE(iter->memcg, parent) // memcg == root_mem_cgroup memcg = get_mem_cgroup_from_objcg(objcg) // do not charge to the root_mem_cgroup try_charge(memcg) obj_cgroup_uncharge_pages(objcg) memcg = get_mem_cgroup_from_objcg(objcg) // uncharge from the root_mem_cgroup refill_stock(memcg) drain_stock(memcg) page_counter_uncharge(&memcg->memory) get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so we never explicitly charge the root_mem_cgroup. And it's not going to change. It's all about a race when we got an obj_cgroup pointing at some non-root memcg, but before we were able to charge it, the cgroup was gone, objcg was reparented to the root and so we're skipping the charging. Then we store the objcg pointer and later use to uncharge the root_mem_cgroup. This can cause the page counter to be less than the actual value. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcg/slab: disable cache merging for KMALLOC_NORMAL cachesWaiman Long1-0/+7
The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only when CONFIG_MEMCG_KMEM is enabled. To make sure that this condition remains true, we will have to prevent KMALOC_NORMAL caches to merge with other kmem caches. This is now done by setting its refcount to -1 right after its creation. Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Roman Gushchin <guro@fb.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcg/slab: create a new set of kmalloc-cg-<n> cachesWaiman Long3-10/+29
There are currently two problems in the way the objcg pointer array (memcg_data) in the page structure is being allocated and freed. On its allocation, it is possible that the allocated objcg pointer array comes from the same slab that requires memory accounting. If this happens, the slab will never become empty again as there is at least one object left (the obj_cgroup array) in the slab. When it is freed, the objcg pointer array object may be the last one in its slab and hence causes kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. One way to solve this problem is to split the kmalloc-<n> caches (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All the other caches can still allow a mix of accounted and unaccounted objects. With this change, all the objcg pointer array objects will come from KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So both the recursive kfree() problem and non-freeable slab problem are gone. Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have mixed accounted and unaccounted objects, this will slightly reduce the number of objcg pointer arrays that need to be allocated and save a bit of memory. On the other hand, creating a new set of kmalloc caches does have the effect of reducing cache utilization. So it is properly a wash. The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches() will include the newly added caches without change. [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem] Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com [akpm@linux-foundation.org: un-fat-finger v5 delta creation] [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches] Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> [longman@redhat.com: fix for CONFIG_ZONE_DMA=n] Suggested-by: Roman Gushchin <guro@fb.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcg/slab: properly set up gfp flags for objcg pointer arrayWaiman Long2-1/+8
Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4. Since the merging of the new slab memory controller in v5.9, the page structure stores a pointer to objcg pointer array for slab pages. When the slab has no used objects, it can be freed in free_slab() which will call kfree() to free the objcg pointer array in memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer array is the last used object in its slab, that slab may then be freed which may caused kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. In fact, we have a reproducer that can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following kfree() loop recursively called 74 times: [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I also found an issue on the allocation side. If the objcg pointer array happen to come from the same slab or a circular dependency linkage is formed with multiple slabs, those affected slabs can never be freed again. This patch series addresses these two issues by introducing a new set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new set will only contain non-reclaimable and non-dma objects that are accounted in memory cgroups whereas the old set are now for unaccounted objects only. By making this split, all the objcg pointer arrays will come from the kmalloc-<n> caches, but those caches will never hold any objcg pointer array. As a result, deeply nested kfree() call and the unfreeable slab problems are now gone. This patch (of 4): Since the merging of the new slab memory controller in v5.9, the page structure may store a pointer to obj_cgroup pointer array for slab pages. Currently, only the __GFP_ACCOUNT bit is masked off. However, the array is not readily reclaimable and doesn't need to come from the DMA buffer. So those GFP bits should be masked off as well. Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure that it is consistently applied no matter where it is called. Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/memcg: optimize user context object stock accessWaiman Long1-28/+72
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two sets of object stocks (in the new obj_stock structure) for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. [longman@redhat.com: fix potential uninitialized variable warning] Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <guro@fb.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/memcg: improve refill_obj_stock() performanceWaiman Long1-13/+35
There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new allow_uncharge flag is added to refill_obj_stock() which will be set to false when called from obj_cgroup_charge() so that an uncharge_pages() call won't be issued right after a charge_pages() call unless the objcg changes. A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core 96-thread x86-64 system with 96 testing threads were run. Before this patch, the total number of kilo kmalloc+kfree operations done for a 4k large object by all the testing threads per second were 4,304 kops/s (cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This represents a performance improvement of 1.10X (cgroup v1) and 49.3X (cgroup v2). Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/memcg: cache vmstat data in percpu memcg_stock_pcpWaiman Long1-3/+87
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/memcg: move mod_objcg_state() to memcontrol.cWaiman Long2-14/+15
Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object or a big 4k object at module init time with a batch size of 4 (4 kmalloc's followed by 4 kfree's) is used for benchmarking. The benchmarking tool was run on a kernel based on linux-next-20210419. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. The small object test exercises mainly the object stock charging and vmstat update code paths. The large object test also exercises the refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code paths. With memory accounting disabled, the run time was 3.130s with both small object big object tests. With memory accounting enabled, both cgroup v1 and v2 showed similar results in the small object test. The performance results of the large object test, however, differed between cgroup v1 and v2. The execution times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead %age 1 %age 2 --------------- -------- ------------------- ------ ------ Small 32-byte object: None 11.634s 8.504s 100.0% 271.7% 1-2 9.425s 6.295s 74.0% 201.1% 1-3 9.708s 6.578s 77.4% 210.2% 1-4 8.062s 4.932s 58.0% 157.6% Large 4k object (v2): None 22.107s 18.977s 100.0% 606.3% 1-2 20.960s 17.830s 94.0% 569.6% 1-3 14.238s 11.108s 58.5% 354.9% 1-4 11.329s 8.199s 43.2% 261.9% Large 4k object (v1): None 36.807s 33.677s 100.0% 1075.9% 1-2 36.648s 33.518s 99.5% 1070.9% 1-3 22.345s 19.215s 57.1% 613.9% 1-4 18.662s 15.532s 46.1% 496.2% N.B. %age 1 = overhead/unpatched overhead %age 2 = overhead/accounting disabled time Patch 2 (vmstat data stock caching) helps in both the small object test and the large v2 object test. It doesn't help much in v1 big object test. Patch 3 (refill_obj_stock improvement) does help the small object test but offer significant performance improvement for the large object test (both v1 and v2). Patch 4 (eliminating irq disable/enable) helps in all test cases. To test for the extreme case, a multi-threaded kmalloc/kfree microbenchmark was run on the 2-socket 48-core 96-thread system with 96 testing threads in the same memcg doing kmalloc+kfree of a 4k object with accounting enabled for 10s. The total number of kmalloc+kfree done in kilo operations per second (kops/s) were as follows: Applied patches v1 kops/s v1 change v2 kops/s v2 change --------------- --------- --------- --------- --------- None 3,520 1.00X 6,242 1.00X 1-2 4,304 1.22X 8,478 1.36X 1-3 4,731 1.34X 418,142 66.99X 1-4 4,587 1.30X 438,838 70.30X With memory accounting disabled, the kmalloc/kfree rate was 1,481,291 kop/s. This test shows how significant the memory accouting overhead can be in some extreme situations. For this multithreaded test, the improvement from patch 2 mainly comes from the conditional atomic xchg of objcg->nr_charged_bytes in mod_objcg_state(). By using an unconditional xchg, the operation rates were similar to the unpatched kernel. Patch 3 elminates the single highly contended cacheline of objcg->nr_charged_bytes for cgroup v2 leading to a huge performance improvement. Cgroup v1, however, still has another highly contended cacheline in the shared page counter &memcg->kmem. So the improvement is only modest. Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as eliminating the irq_disable/irq_enable overhead seems to aggravate the cacheline contention. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ This patch (of 4): mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29swap: check mapping_empty() for swap cache before being freedHuang Ying1-1/+6
To check whether all pages and shadow entries in swap cache has been removed before swap cache is freed. Link: https://lkml.kernel.org/r/20210608005121.511140-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: free idle swap cache page after COWHuang Ying2-1/+3
With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after COW, the idle swap cache page (neither the page nor the corresponding swap entry is mapped by any process) will be left in the LRU list, even if it's in the active list or the head of the inactive list. So, the page reclaimer may take quite some overhead to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache page will be tried to be freed. To avoid to introduce much overhead to the hot COW code path, a) there's almost zero overhead for non-swap case via checking PageSwapCache() firstly. b) the page lock is acquired via trylock only. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 23.8% with the decreased size of swap cache and swapin throughput. Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm, swap: remove unnecessary smp_rmb() in swap_type_to_swap_info()Huang Ying1-9/+6
Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array accesses to avoid NULL derefs"), the typical code to reference the swap_info[] is as follows, type = swp_type(swp_entry); if (type >= nr_swapfiles) /* handle invalid swp_entry */; p = swap_info[type]; /* access fields of *p. OOPS! p may be NULL! */ Because the ordering isn't guaranteed, it's possible that swap_info[type] is read before "nr_swapfiles". And that may result in NULL pointer dereference. So after commit c10d38cc8d3e, the code becomes, struct swap_info_struct *swap_type_to_swap_info(int type) { if (type >= READ_ONCE(nr_swapfiles)) return NULL; smp_rmb(); return READ_ONCE(swap_info[type]); } /* users */ type = swp_type(swp_entry); p = swap_type_to_swap_info(type); if (!p) /* handle invalid swp_entry */; /* dereference p */ Where the value of swap_info[type] (that is, "p") is checked to be non-zero before being dereferenced. So, the NULL deferencing becomes impossible even if "nr_swapfiles" is read after swap_info[type]. Therefore, the "smp_rmb()" becomes unnecessary. And, we don't even need to read "nr_swapfiles" here. Because the non-zero checking for "p" is sufficient. We just need to make sure we will not access out of the boundary of the array. With the change, nr_swapfiles will only be accessed with swap_lock held, except in swapcache_free_entries(). Where the absolute correctness of the value isn't needed, as described in the comments. We still need to guarantee swap_info[type] is read before being dereferenced. That can be satisfied via the data dependency ordering enforced by READ_ONCE(swap_info[type]). This needs to be paired with proper write barriers. So smp_store_release() is used in alloc_swap_info() to guarantee the fields of *swap_info[type] is initialized before swap_info[type] itself being written. Note that the fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly. The assignment and deferencing of swap_info[type] is like rcu_assign_pointer() and rcu_dereference(). Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swap_slots.c: delete meaningless forward declarationsMiaohe Lin1-2/+0
deactivate_swap_slots_cache() and reactivate_swap_slots_cache() are only called below their implementations. So these forward declarations are meaningless and should be removed. Link: https://lkml.kernel.org/r/20210520134022.1370406-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swap: remove unused local variable nr_shadowsMiaohe Lin1-5/+0
Since commit 55c653b71e8c ("mm: stop accounting shadow entries"), nr_shadows is not used anymore. Link: https://lkml.kernel.org/r/20210520134022.1370406-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swapfile: move get_swap_page_of_type() under CONFIG_HIBERNATIONMiaohe Lin1-52/+31
Patch series "Cleanups for swap", v2. This series contains just cleanups to remove some unused variables, delete meaningless forward declarations and so on. More details can be found in the respective changelogs. This patch (of 4): We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the only caller of this function is now suspend routine. [linmiaohe@huawei.com: move scan_swap_map() under CONFIG_HIBERNATION] Link: https://lkml.kernel.org/r/20210521070855.2015094-1-linmiaohe@huawei.com [linmiaohe@huawei.com: fold scan_swap_map() into the only caller get_swap_page_of_type()] Link: https://lkml.kernel.org/r/20210527120328.3935132-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210520134022.1370406-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210520134022.1370406-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/shmem: fix shmem_swapin() race with swapoffMiaohe Lin1-1/+13
When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- shmem_swapin swap_cluster_readahead if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) { swapoff .. si->swap_file = NULL; .. struct inode *inode = si->swap_file->f_mapping->host;[oops!] Close this race window by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()Miaohe Lin1-6/+0
The non_swap_entry() was used for working with VMA based swap readahead via commit ec560175c0b6 ("mm, swap: VMA based swap readahead"). At that time, the non_swap_entry() checking is necessary because the function is called before checking that in do_swap_page(). Then it's moved to swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap readahead"). After that, the non_swap_entry() checking is unnecessary, because swap_ra_info() is called after non_swap_entry() has been checked already. The resulting code is confusing as the non_swap_entry() check looks racy now because while we released the pte lock, somebody else might have faulted in this pte. So we should check whether it's swap pte first to guard against such race or swap_type will be unexpected. But the race isn't important because it will not cause problem. We would have enough checking when we really operate the PTE entries later. So we remove the non_swap_entry() check here to avoid confusion. Link: https://lkml.kernel.org/r/20210426123316.806267-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29swap: fix do_swap_page() race with swapoffMiaohe Lin1-2/+9
When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- do_swap_page if (data_race(si->flags & SWP_SYNCHRONOUS_IO) swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff .. p->swap_file = NULL; .. struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[oops!] Note that for the pages that are swapped in through swap cache, this isn't an issue. Because the page is locked, and the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been unlocked. Fix this race by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swapfile: use percpu_ref to serialize against concurrent swapoffMiaohe Lin1-30/+49
Patch series "close various race windows for swap", v6. When I was investigating the swap code, I found some possible race windows. This series aims to fix all these races. But using current get/put_swap_device() to guard against concurrent swapoff for swap_readpage() looks terrible because swap_readpage() may take really long time. And to reduce the performance overhead on the hot-path as much as possible, it appears we can use the percpu_ref to close this race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support for swap and most of the remaining patches try to use this to close various race windows. More details can be found in the respective changelogs. This patch (of 4): Using current get/put_swap_device() to guard against concurrent swapoff for some swap ops, e.g. swap_readpage(), looks terrible because they might take really long time. This patch adds the percpu_ref support to serialize against concurrent swapoff(as suggested by Huang, Ying). Also we remove the SWP_VALID flag because it's used together with RCU solution. Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: pagewalk: fix walk for hugepage tablesChristophe Leroy1-5/+53
Pagewalk ignores hugepd entries and walk down the tables as if it was traditionnal entries, leading to crazy result. Add walk_hugepd_range() and use it to walk hugepage tables. Link: https://lkml.kernel.org/r/38d04410700c8d02f28ba37e020b62c55d6f3d2c.1624597695.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Daniel Axtens <dja@axtens.net> Cc: "Oliver O'Halloran" <oohall@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: gup: pack has_pinned in MMF_HAS_PINNEDAndrea Arcangeli1-4/+15
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: gup: allow FOLL_PIN to scale in SMPAndrea Arcangeli1-2/+2
has_pinned cannot be written by each pin-fast or it won't scale in SMP. This isn't "false sharing" strictly speaking (it's more like "true non-sharing"), but it creates the same SMP scalability bottleneck of "false sharing". To verify the improvement, below test is done on 40 cpus host with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with CONFIG_GUP_TEST=y): $ sudo chrt -f 1 ./gup_test -a -m 512 -j 40 Where we can get (average value for 40 threads): Old kernel: 477729.97 (+- 3.79%) New kernel: 89144.65 (+-11.76%) On a similar condition with 256 cpus, this commits increases the SMP scalability of pin_user_pages_fast() executed by different threads of the same process by more than 4000%. [peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"] Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29fs: remove noop_set_page_dirty()Matthew Wilcox (Oracle)1-0/+1
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/writeback: use __set_page_dirty in __set_page_dirty_nobuffersMatthew Wilcox (Oracle)1-9/+1
This is fundamentally the same code, so just call it instead of duplicating it. Link: https://lkml.kernel.org/r/20210615162342.1669332-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/writeback: move __set_page_dirty() to core mmMatthew Wilcox (Oracle)1-1/+26
Patch series "Further set_page_dirty cleanups". Prompted by Christoph's recent patches, here are some more patches to improve the state of set_page_dirty(). They're all from the folio tree, so they've been tested to a certain extent. This patch (of 6): Nothing in __set_page_dirty() is specific to buffer_head, so move it to mm/page-writeback.c. That removes the only caller of account_page_dirtied() outside of page-writeback.c, so make it static. Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: require ->set_page_dirty to be explicitly wired upChristoph Hellwig1-14/+4
Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire that method up for the missing instances. [hch@lst.de: ecryptfs: add a ->set_page_dirty cludge] Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tyler Hicks <code@tyhicks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29writeback, cgroup: release dying cgwbs by switching attached inodesRoman Gushchin1-2/+62
Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are significantly larger structures (mostly due to large per-cpu statistics data). This prevents memory waste and helps to avoid different scalability problems caused by large piles of dying cgroups. Reuse the existing mechanism of inode switching used for foreign inode detection. To speed things up batch up to 115 inode switching in a single operation (the maximum number is selected so that the resulting struct inode_switch_wbs_context can fit into 1024 bytes). Because every switching consists of two steps divided by an RCU grace period, it would be too slow without batching. Please note that the whole batch counts as a single operation (when increasing/decreasing isw_nr_in_flight). This allows to keep umounting working (flush the switching queue), however prevents cleanups from consuming the whole switching quota and effectively blocking the frn switching. A cgwb cleanup operation can fail due to different reasons (e.g. not enough memory, the cgwb has an in-flight/pending io, an attached inode in a wrong state, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. [guro@fb.com: replace open-coded "115" with arithmetic] Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com [guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()] Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com [willy@infradead.org: fix documentation] Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29writeback, cgroup: keep list of inodes attached to bdi_writebackRoman Gushchin1-0/+2
Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive. While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure. This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently. Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: use __this_cpu_inc() in account_page_dirtied()Chi Wu1-1/+1
As account_page_dirtied() was always protected by xa_lock_irqsave(), so using __this_cpu_inc() is better. Link: https://lkml.kernel.org/r/20210512144742.4764-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: update the comment of Dirty position controlChi Wu1-1/+1
As the value of pos_ratio_polynom() clamp between 0 and 2LL << RATELIMIT_CALC_SHIFT, the global control line should be consistent with it. Link: https://lkml.kernel.org/r/20210511103606.3732-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: Fix performance when BDI's share of ratio is 0.Chi Wu1-4/+16
Fix performance when BDI's share of ratio is 0. The issue is similar to commit 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()"). Balance_dirty_pages and the writeback worker will also disagree on whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share of the thresh ratio is zero. For example, A thread on cpu0 writes 32 pages and then balance_dirty_pages, it will wake up background writeback and pauses because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero). A thread may runs on cpu0 again because scheduler prefers pre_cpu. Then writeback worker may runs on other cpus(1,2..) which causes the value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does not writeback and returns. Thus, balance_dirty_pages keeps looping, sleeping and then waking up the worker who will do nothing. It remains stuck in this state until the writeback worker hit the right dirty cpu or the dirty pages expire. The fix that we should get the wb_stat_sum radically when thresh is low. Link: https://lkml.kernel.org/r/20210428225046.16301-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: page-writeback: kill get_writeback_state() commentsKefeng Wang1-6/+3
The get_writeback_state() has gone since 2006, kill related comments. Link: https://lkml.kernel.org/r/20210508125026.56600-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_reporting: allow driver to specify reporting orderGavin Shan1-0/+6
The page reporting order (threshold) is sticky to @pageblock_order by default. The page reporting can never be triggered because the freeing page can't come up with a free area like that huge. The situation becomes worse when the system memory becomes heavily fragmented. For example, the following configurations are used on ARM64 when 64KB base page size is enabled. In this specific case, the page reporting won't be triggered until the freeing page comes up with a 512MB free area. That's hard to be met, especially when the system memory becomes heavily fragmented. PAGE_SIZE: 64KB HPAGE_SIZE: 512MB pageblock_order: 13 (512MB) MAX_ORDER: 14 This allows the drivers to specify the page reporting order when the page reporting device is registered. It falls back to @pageblock_order if it's not specified by the driver. The existing users (hv_balloon and virtio_balloon) don't specify it and @pageblock_order is still taken as their page reporting order. So this shouldn't introduce any functional changes. Link: https://lkml.kernel.org/r/20210625014710.42954-4-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_reporting: export reporting order as module parameterGavin Shan2-5/+9
The macro PAGE_REPORTING_MIN_ORDER is defined as the page reporting threshold. It can't be adjusted at runtime. This introduces a variable (@page_reporting_order) to replace the marcro (PAGE_REPORTING_MIN_ORDER). MAX_ORDER is assigned to it initially, meaning the page reporting is disabled. It will be specified by driver if valid one is provided. Otherwise, it will fall back to @pageblock_order. It's also exported so that the page reporting order can be adjusted at runtime. Link: https://lkml.kernel.org/r/20210625014710.42954-3-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_reporting: fix code style in __page_reporting_request()Gavin Shan1-2/+2
Patch series "mm/page_reporting: Make page reporting work on arm64 with 64KB page size", v4. The page reporting threshold is currently equal to @pageblock_order, which is 13 and 512MB on arm64 with 64KB base page size selected. The page reporting won't be triggered if the freeing page can't come up with a free area like that huge. The condition is hard to be met, especially when the system memory becomes fragmented. This series intends to solve the issue by having page reporting threshold as 5 (2MB) on arm64 with 64KB base page size. The patches are organized as: PATCH[1/4] Fix some coding style in __page_reporting_request(). PATCH[2/4] Represents page reporting order with variable so that it can be exported as module parameter. PATCH[3/4] Allows the device driver (e.g. virtio_balloon) to specify the page reporting order when the device info is registered. PATCH[4/4] Specifies the page reporting order to 5, corresponding to 2MB in size on ARM64 when 64KB base page size is used. This patch (of 4): The lines of comments would be starting with one, instead two space. This corrects the style. Link: https://lkml.kernel.org/r/20210625014710.42954-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20210625014710.42954-2-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: mmap_lock: use local locks instead of disabling preemptionNicolas Saenz Julienne1-11/+22
mmap_lock will explicitly disable/enable preemption upon manipulating its local CPU variables. This is to be expected, but in this case, it doesn't play well with PREEMPT_RT. The preemption disabled code section also takes a spin-lock. Spin-locks in RT systems will try to schedule, which is exactly what we're trying to avoid. To mitigate this, convert the explicit preemption handling to local_locks. Which are RT aware, and will disable migration instead of preemption when PREEMPT_RT=y. The faulty call trace looks like the following: __mmap_lock_do_trace_*() preempt_disable() get_mm_memcg_path() cgroup_path() kernfs_path_from_node() spin_lock_irqsave() /* Scheduling while atomic! */ Link: https://lkml.kernel.org/r/20210604163506.2103900-1-nsaenzju@redhat.com Fixes: 2b5067a8143e3 ("mm: mmap_lock: add tracepoints around lock acquisition ") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/debug_vm_pgtable: ensure THP availability via has_transparent_hugepage()Anshuman Khandual1-12/+51
On certain platforms, THP support could not just be validated via the build option CONFIG_TRANSPARENT_HUGEPAGE. Instead has_transparent_hugepage() also needs to be called upon to verify THP runtime support. Otherwise the debug test will just run into unusable THP helpers like in the case of a 4K hash config on powerpc platform [1]. This just moves all pfn_pmd() and pfn_pud() after THP runtime validation with has_transparent_hugepage() which prevents the mentioned problem. [1] https://bugzilla.kernel.org/show_bug.cgi?id=213069 Link: https://lkml.kernel.org/r/1621397588-19211-1-git-send-email-anshuman.khandual@arm.com Fixes: 787d563b8642 ("mm/debug_vm_pgtable: fix kernel crash by checking for THP support") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/kmemleak: fix possible wrong memory scanning periodYanfei Xu1-6/+12
This commit contains 3 modifications: 1. Convert the type of jiffies_scan_wait to "unsigned long". 2. Use READ/WRITE_ONCE() for accessing "jiffies_scan_wait". 3. Fix the possible wrong memory scanning period. If you set a large memory scanning period like blow, then the "secs" variable will be non-zero, however the value of "jiffies_scan_wait" will be zero. echo "scan=0x10000000" > /sys/kernel/debug/kmemleak It is because the type of the msecs_to_jiffies()'s parameter is "unsigned int", and the "secs * 1000" is larger than its max value. This in turn leads a unexpected jiffies_scan_wait, maybe zero. We corret it by replacing kstrtoul() with kstrtouint(), and check the msecs to prevent it larger than UINT_MAX. Link: https://lkml.kernel.org/r/20210613174022.23044-1-yanfei.xu@windriver.com Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/slub: add taint after the errors are printedGeorgi Djakov1-2/+3
When running the kernel with panic_on_taint, the usual slub debug error messages are not being printed when object corruption happens. That's because we panic in add_taint(), which is called before printing the additional information. This is a bit unfortunate as the error messages are actually very useful, especially before a panic. Let's fix this by moving add_taint() after the errors are printed on the console. Link: https://lkml.kernel.org/r/1623860738-146761-1-git-send-email-quic_c_gdjako@quicinc.com Signed-off-by: Georgi Djakov <quic_c_gdjako@quicinc.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: slub: move sysfs slab alloc/free interfaces to debugfsFaiyaz Mohammed3-93/+189
alloc_calls and free_calls implementation in sysfs have two issues, one is PAGE_SIZE limitation of sysfs and other is it does not adhere to "one value per file" rule. To overcome this issues, move the alloc_calls and free_calls implementation to debugfs. Debugfs cache will be created if SLAB_STORE_USER flag is set. Rename the alloc_calls/free_calls to alloc_traces/free_traces, to be inline with what it does. [faiyazm@codeaurora.org: fix the leak of alloc/free traces debugfs interface] Link: https://lkml.kernel.org/r/1624248060-30286-1-git-send-email-faiyazm@codeaurora.org Link: https://lkml.kernel.org/r/1623438200-19361-1-git-send-email-faiyazm@codeaurora.org Signed-off-by: Faiyaz Mohammed <faiyazm@codeaurora.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: force on no_hash_pointers when slub_debug is enabledStephen Boyd1-1/+19
Obscuring the pointers that slub shows when debugging makes for some confusing slub debug messages: Padding overwritten. 0x0000000079f0674a-0x000000000d4dce17 Those addresses are hashed for kernel security reasons. If we're trying to be secure with slub_debug on the commandline we have some big problems given that we dump whole chunks of kernel memory to the kernel logs. Let's force on the no_hash_pointers commandline flag when slub_debug is on the commandline. This makes slub debug messages more meaningful and if by chance a kernel address is in some slub debug object dump we will have a better chance of figuring out what went wrong. Note that we don't use %px in the slub code because we want to reduce the number of places that %px is used in the kernel. This also nicely prints a big fat warning at kernel boot if slub_debug is on the commandline so that we know that this kernel shouldn't be used on production systems. [akpm@linux-foundation.org: fix build with CONFIG_SLUB_DEBUG=n] Link: https://lkml.kernel.org/r/20210601182202.3011020-5-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: indicate slab_fix() uses printf formatsJoe Perches1-3/+4
Ideally, slab_fix() would be marked with __printf and the format here would not use \n as that's emitted by the slab_fix(). Make these changes. Link: https://lkml.kernel.org/r/20210601182202.3011020-4-swboyd@chromium.org Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: actually use 'message' in restore_bytes()Stephen Boyd1-1/+1
The message argument isn't used here. Let's pass the string to the printk message so that the developer can figure out what's happening, instead of guessing that a redzone is being restored, etc. Link: https://lkml.kernel.org/r/20210601182202.3011020-3-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: restore slub_debug=- behaviorStephen Boyd1-0/+2
Petch series "slub: Print non-hashed pointers in slub debugging", v3. I was doing some debugging recently and noticed that my pointers were being hashed while slub_debug was on the kernel commandline. Let's force on the no hash pointer option when slub_debug is on the kernel commandline so that the prints are more meaningful. The first two patches are something else I noticed while looking at the code. The message argument is never used so the debugging messages are not as clear as they could be and the slub_debug=- behavior seems to be busted. Then there's a printf fixup from Joe and the final patch is the one that force disables pointer hashing. This patch (of 4): Passing slub_debug=- on the kernel commandline is supposed to disable slub debugging. This is especially useful with CONFIG_SLUB_DEBUG_ON where the default is to have slub debugging enabled in the build. Due to some code reorganization this behavior was dropped, but the code to make it work mostly stuck around. Restore the previous behavior by disabling the static key when we parse the commandline and see that we're trying to disable slub debugging. Link: https://lkml.kernel.org/r/20210601182202.3011020-1-swboyd@chromium.org Link: https://lkml.kernel.org/r/20210601182202.3011020-2-swboyd@chromium.org Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm, slub: change run-time assertion in kmalloc_index() to compile-timeHyeonggon Yoo2-6/+6
Currently when size is not supported by kmalloc_index, compiler will generate a run-time BUG() while compile-time error is also possible, and better. So change BUG to BUILD_BUG_ON_MSG to make compile-time check possible. Also remove code that allocates more than 32MB because current implementation supports only up to 32MB. [42.hyeyoo@gmail.com: fix support for clang 10] Link: https://lkml.kernel.org/r/20210518181247.GA10062@hyeyoo [vbabka@suse.cz: fix false-positive assert in kernel/bpf/local_storage.c] Link: https://lkml.kernel.org/r/bea97388-01df-8eac-091b-a3c89b4a4a09@suse.czLink: https://lkml.kernel.org/r/20210511173448.GA54466@hyeyoo [elver@google.com: kfence fix] Link: https://lkml.kernel.org/r/20210512195227.245000695c9014242e9a00e5@linux-foundation.org Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Marco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: remove resiliency_test() functionOliver Glitta1-64/+0
Function resiliency_test() is hidden behind #ifdef SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it. This function is replaced with KUnit test for SLUB added by the previous patch "selftests: add a KUnit test for SLUB debugging functionality". Link: https://lkml.kernel.org/r/20210511150734.3492-3-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oliver Glitta <glittao@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/slub, kunit: add a KUnit test for SLUB debugging functionalityOliver Glitta3-3/+47
SLUB has resiliency_test() function which is hidden behind #ifdef SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it. KUnit should be a proper replacement for it. Try changing byte in redzone after allocation and changing pointer to next free node, first byte, 50th byte and redzone byte. Check if validation finds errors. There are several differences from the original resiliency test: Tests create own caches with known state instead of corrupting shared kmalloc caches. The corruption of freepointer uses correct offset, the original resiliency test got broken with freepointer changes. Scratch changing random byte test, because it does not have meaning in this form where we need deterministic results. Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer, first_word and clobber_50th_byte do not run with KASAN option on. Because the test deliberately modifies non-allocated objects. Use kunit_resource to count errors in cache and silence bug reports. Count error whenever slab_bug() or slab_fix() is called or when the count of pages is wrong. [glittao@gmail.com: remove unused function test_exit(), from SLUB KUnit test] Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com [akpm@linux-foundation.org: export kasan_enable/disable_current to modules] Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Daniel Latypov <dlatypov@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slab: use __func__ to trace function namegumingtao1-6/+6
It is better to use __func__ to trace function name. Link: https://lkml.kernel.org/r/31fdbad5c45cd1e26be9ff37be321b8586b80fee.1624355507.git.gumingtao@xiaomi.com Signed-off-by: gumingtao <gumingtao@xiaomi.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_alloc: correct return value of populated elements if bulk array is ↵Mel Gorman1-1/+1
populated Dave Jones reported the following This made it into 5.13 final, and completely breaks NFSD for me (Serving tcp v3 mounts). Existing mounts on clients hang, as do new mounts from new clients. Rebooting the server back to rc7 everything recovers. The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements") returns the wrong value if the array is already populated which is interpreted as an allocation failure. Dave reported this fixes his problem and it also passed a test running dbench over NFS. Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Dave Jones <davej@codemonkey.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>