From 3b38722efd9f66da63bbbd41520c2e6fa9db3d68 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 12 Sep 2013 15:13:21 -0700 Subject: memcg, vmscan: integrate soft reclaim tighter with zone shrinking code This patchset is sitting out of tree for quite some time without any objections. I would be really happy if it made it into 3.12. I do not want to push it too hard but I think this work is basically ready and waiting more doesn't help. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which has been updated in the third step to have only negligible overhead. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before when it wasn't integrated into the zone shrinking code and it reclaimed at priority 0 (the testing results show that some workloads suffers from such an aggressive reclaim). The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes sense for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. The third step (patches 4-8) addresses the tree walk overhead by enhancing memcg iterators to enable skipping whole subtrees and tracking number of over soft limit children at each level of the hierarchy. This information is updated same way the old soft limit tree was updated (from memcg_check_events) so we shouldn't see an additional overhead. In fact mem_cgroup_update_soft_limit is much simpler than tree manipulation done previously. __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for mem_cgroup_iter so the decision whether a particular group should be visited is done at the iterator level which allows us to decide to skip the whole subtree as well (if there is no child in excess). This reduces the tree walk overhead considerably. * TEST 1 ======== My primary test case was a parallel kernel build with 2 groups (make is running with -j8 with a distribution .config in a separate cgroup without any hard limit) on a 32 CPU machine booted with 1GB memory and both builds run taskset to Node 0 cpus. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. base is mmotm-2013-07-18-16-40 rework all 8 patches applied on top of base * No-limit User no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6 no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6 System no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6 no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6 Elapsed no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6 no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6 The results are within noise. Elapsed time has a bigger variance but the average looks good. * 0-limit User 0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6 0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6 System 0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6 0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6 Elapsed 0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6 0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6 The improvement is really huge here (even bigger than with my previous testing and I suspect that this highly depends on the storage). Page fault statistics tell us at least part of the story: Minor 0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6 0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6 Major 0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6 0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6 Same as with my previous testing Minor faults are more or less within noise but Major fault count is way bellow the base kernel. While this looks as a nice win it is fair to say that 0-limit configuration is quite artificial. So I was playing with 0-no-limit loads as well. * TEST 2 ======== The following results are from 2 groups configuration on a 16GB machine (single NUMA node). - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with 2*TotalMem with 0 soft limit. - B running a mem_eater which consumes TotalMem-1G without any limit. The mem_eater consumes the memory in 100 chunks with 1s nap after each mmap+poppulate so that both loads have chance to fight for the memory. The expected result is that B shouldn't be reclaimed and A shouldn't see a big dropdown in elapsed time. User base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3 rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3 System base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3 rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3 Elapsed base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3 rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3 System time improved slightly as well as Elapsed. My previous testing has shown worse numbers but this again seem to depend on the storage speed. My theory is that the writeback doesn't catch up and prio-0 soft reclaim falls into wait on writeback page too often in the base kernel. The patched kernel doesn't do that because the soft reclaim is done from the kswapd/direct reclaim context. This can be seen on the following graph nicely. The A's group usage_in_bytes regurarly drops really low very often. All 3 runs http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png resp. a detail of the single run http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png mem_eater seems to be doing better as well. It gets to the full allocation size faster as can be seen on the following graph: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png /proc/meminfo collected during the test also shows that rework kernel hasn't swapped that much (well almost not at all): base: max: 123900 K avg: 56388.29 K rework: max: 300 K avg: 128.68 K kswapd and direct reclaim statistics are of no use unfortunatelly because soft reclaim is not accounted properly as the counters are hidden by global_reclaim() checks in the base kernel. * TEST 3 ======== Another test was the same configuration as TEST2 except the stream IO was replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB left. Kbuild did better with the rework kernel here as well: User base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3 rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3 System base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3 rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3 Elapsed base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3 rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3 Minor base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3 rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3 Major base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3 rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3 Again we can see a significant improvement in Elapsed (it also seems to be more stable), there is a huge dropdown for the Major page faults and much more swapping: base: max: 583736 K avg: 112547.43 K rework: max: 4012 K avg: 124.36 K Graphs from all three runs show the variability of the kbuild quite nicely. It even seems that it took longer after every run with the base kernel which would be quite surprising as the source tree for the build is removed and caches are dropped after each run so the build operates on a freshly extracted sources everytime. http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png My other testing shows that this is just a matter of timing and other runs behave differently the std for Elapsed time is similar ~50. Example of other three runs: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png So to wrap this up. The series is still doing good and improves the soft limit. The testing results for bunch of cgroups with both stream IO and kbuild loads can be found in "memcg: track children in soft limit excess to improve soft limit". This patch: Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Signed-off-by: Michal Hocko Reviewed-by: Glauber Costa Reviewed-by: Tejun Heo Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: Ying Han Cc: Hugh Dickins Cc: Michel Lespinasse Cc: Greg Thelen Cc: KOSAKI Motohiro Cc: Balbir Singh Cc: Glauber Costa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c416092e324..4b78661c68d0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -180,9 +180,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -359,11 +357,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) -- cgit v1.2.3 From a5b7c87f92076352dbff2fe0423ec255e1c9a71b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 12 Sep 2013 15:13:25 -0700 Subject: vmscan, memcg: do softlimit reclaim also for targeted reclaim Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. From the semantic point of view it is natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Signed-off-by: Michal Hocko Reviewed-by: Glauber Costa Reviewed-by: Tejun Heo Cc: Balbir Singh Cc: Greg Thelen Cc: Hugh Dickins Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Michel Lespinasse Cc: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4b78661c68d0..d8dd6560621b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -357,7 +358,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 87a448dd9c10..c016e001c5b2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1773,11 +1773,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1785,12 +1787,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index cf4643807ec2..1896e7ca494b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return !mem_cgroup_disabled() && global_reclaim(sc); + return !mem_cgroup_disabled(); } #else static bool global_reclaim(struct scan_control *sc) @@ -2161,7 +2161,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- cgit v1.2.3 From de57780dc659f95b17ccb649f003278dde0b5b86 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 12 Sep 2013 15:13:26 -0700 Subject: memcg: enhance memcg iterator to support predicates The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. [hughd@google.com: fix memcg-less page reclaim] Signed-off-by: Michal Hocko Cc: Balbir Singh Cc: Glauber Costa Cc: Greg Thelen Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Michel Lespinasse Cc: Tejun Heo Cc: Ying Han Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 49 ++++++++++++++++++++++++++++---- mm/memcontrol.c | 70 ++++++++++++++++++++++++++++++++++++---------- mm/vmscan.c | 16 ++++------- 3 files changed, 103 insertions(+), 32 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d8dd6560621b..d4d1f9b0dbba 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -41,6 +41,23 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +enum mem_cgroup_filter_t { + VISIT, /* visit current node */ + SKIP, /* skip the current node and continue traversal */ + SKIP_TREE, /* skip the whole subtree and continue traversal */ +}; + +/* + * mem_cgroup_filter_t predicate might instruct mem_cgroup_iter_cond how to + * iterate through the hierarchy tree. Each tree element is checked by the + * predicate before it is returned by the iterator. If a filter returns + * SKIP or SKIP_TREE then the iterator code continues traversal (with the + * next node down the hierarchy or the next node that doesn't belong under the + * memcg's subtree). + */ +typedef enum mem_cgroup_filter_t +(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); + #ifdef CONFIG_MEMCG /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -108,9 +125,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage, extern void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok); -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, - struct mem_cgroup *, - struct mem_cgroup_reclaim_cookie *); +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond); + +static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim) +{ + return mem_cgroup_iter_cond(root, prev, reclaim, NULL); +} + void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); /* @@ -180,7 +206,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -295,6 +322,15 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok) { } +static inline struct mem_cgroup * +mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) +{ + /* first call must return non-NULL, second return NULL */ + return (struct mem_cgroup *)(unsigned long)!prev; +} static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, @@ -358,10 +394,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { - return false; + return VISIT; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c016e001c5b2..a4bb857d902c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -875,6 +875,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +static enum mem_cgroup_filter_t +mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root, + mem_cgroup_iter_filter cond) +{ + if (!cond) + return VISIT; + return cond(memcg, root); +} + /* * Returns a next (in a pre-order walk) alive memcg (with elevated css * ref. count) or NULL if the whole root's subtree has been visited. @@ -882,7 +891,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) * helper function to be used by mem_cgroup_iter */ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, - struct mem_cgroup *last_visited) + struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond) { struct cgroup_subsys_state *prev_css, *next_css; @@ -900,11 +909,31 @@ skip_node: if (next_css) { struct mem_cgroup *mem = mem_cgroup_from_css(next_css); - if (css_tryget(&mem->css)) - return mem; - else { + switch (mem_cgroup_filter(mem, root, cond)) { + case SKIP: prev_css = next_css; goto skip_node; + case SKIP_TREE: + if (mem == root) + return NULL; + /* + * css_rightmost_descendant is not an optimal way to + * skip through a subtree (especially for imbalanced + * trees leaning to right) but that's what we have right + * now. More effective solution would be traversing + * right-up for first non-NULL without calling + * css_next_descendant_pre afterwards. + */ + prev_css = css_rightmost_descendant(next_css); + goto skip_node; + case VISIT: + if (css_tryget(&mem->css)) + return mem; + else { + prev_css = next_css; + goto skip_node; + } + break; } } @@ -968,6 +997,7 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, * @root: hierarchy root * @prev: previously returned memcg, NULL on first invocation * @reclaim: cookie for shared reclaim walks, NULL for full walks + * @cond: filter for visited nodes, NULL for no filter * * Returns references to children of the hierarchy below @root, or * @root itself, or %NULL after a full round-trip. @@ -980,15 +1010,18 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, * divide up the memcgs in the hierarchy among all concurrent * reclaimers operating on the same zone and priority. */ -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, struct mem_cgroup *prev, - struct mem_cgroup_reclaim_cookie *reclaim) + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) { struct mem_cgroup *memcg = NULL; struct mem_cgroup *last_visited = NULL; - if (mem_cgroup_disabled()) - return NULL; + if (mem_cgroup_disabled()) { + /* first call must return non-NULL, second return NULL */ + return (struct mem_cgroup *)(unsigned long)!prev; + } if (!root) root = root_mem_cgroup; @@ -999,7 +1032,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) goto out_css_put; - return root; + if (mem_cgroup_filter(root, root, cond) == VISIT) + return root; + return NULL; } rcu_read_lock(); @@ -1022,7 +1057,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, last_visited = mem_cgroup_iter_load(iter, root, &seq); } - memcg = __mem_cgroup_iter_next(root, last_visited); + memcg = __mem_cgroup_iter_next(root, last_visited, cond); if (reclaim) { mem_cgroup_iter_update(iter, last_visited, memcg, seq); @@ -1033,7 +1068,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, reclaim->generation = iter->generation; } - if (prev && !memcg) + /* + * We have finished the whole tree walk or no group has been + * visited because filter told us to skip the root node. + */ + if (!memcg && (prev || (cond && !last_visited))) goto out_unlock; } out_unlock: @@ -1778,13 +1817,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; if (res_counter_soft_limit_excess(&memcg->res)) - return true; + return VISIT; /* * If any parent up to the root in the hierarchy is over its soft limit @@ -1792,12 +1832,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) - return true; + return VISIT; if (parent == root) break; } - return false; + return SKIP; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 1896e7ca494b..f2e35099508b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2151,21 +2151,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) .zone = zone, .priority = sc->priority, }; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; + mem_cgroup_iter_filter filter = (soft_reclaim) ? + mem_cgroup_soft_reclaim_eligible : NULL; nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - memcg = mem_cgroup_iter(root, NULL, &reclaim); - do { + while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) { struct lruvec *lruvec; - if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg, root)) { - memcg = mem_cgroup_iter(root, memcg, &reclaim); - continue; - } - lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -2185,8 +2180,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) mem_cgroup_iter_break(root, memcg); break; } - memcg = mem_cgroup_iter(root, memcg, &reclaim); - } while (memcg); + } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, -- cgit v1.2.3 From 759496ba6407c6994d6a5ce3a5e74937d7816208 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 12 Sep 2013 15:13:39 -0700 Subject: arch: mm: pass userspace fault flag to generic fault handler Unlike global OOM handling, memory cgroup code will invoke the OOM killer in any OOM situation because it has no way of telling faults occuring in kernel context - which could be handled more gracefully - from user-triggered faults. Pass a flag that identifies faults originating in user space from the architecture-specific fault handlers to generic code so that memcg OOM handling can be improved. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: David Rientjes Cc: KAMEZAWA Hiroyuki Cc: azurIt Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/mm/fault.c | 7 ++++--- arch/arc/mm/fault.c | 6 ++++-- arch/arm/mm/fault.c | 9 ++++++--- arch/arm64/mm/fault.c | 17 ++++++++++------- arch/avr32/mm/fault.c | 2 ++ arch/cris/mm/fault.c | 6 ++++-- arch/frv/mm/fault.c | 10 ++++++---- arch/hexagon/mm/vm_fault.c | 6 ++++-- arch/ia64/mm/fault.c | 6 ++++-- arch/m32r/mm/fault.c | 10 ++++++---- arch/m68k/mm/fault.c | 2 ++ arch/metag/mm/fault.c | 6 ++++-- arch/microblaze/mm/fault.c | 7 +++++-- arch/mips/mm/fault.c | 6 ++++-- arch/mn10300/mm/fault.c | 2 ++ arch/openrisc/mm/fault.c | 1 + arch/parisc/mm/fault.c | 7 +++++-- arch/powerpc/mm/fault.c | 7 ++++--- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault.c | 9 ++++++--- arch/sparc/mm/fault_32.c | 12 +++++++++--- arch/sparc/mm/fault_64.c | 6 ++++-- arch/tile/mm/fault.c | 7 +++++-- arch/um/kernel/trap.c | 20 ++++++++++++-------- arch/unicore32/mm/fault.c | 8 ++++++-- arch/x86/mm/fault.c | 8 +++++--- arch/xtensa/mm/fault.c | 2 ++ include/linux/mm.h | 1 + 29 files changed, 135 insertions(+), 64 deletions(-) (limited to 'include') diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index 0c4132dd3507..98838a05ba6d 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; siginfo_t info; - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (cause > 0 ? FAULT_FLAG_WRITE : 0)); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults (or is suppressed by the PALcode). Support that for older CPUs @@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr, if (address >= TASK_SIZE) goto vmalloc_fault; #endif - + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -142,6 +142,7 @@ retry: } else { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } /* If for any reason at all we couldn't handle the fault, diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 6b0bb415af40..d63f3de0cd5b 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address) siginfo_t info; int fault, ret; int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */ - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* * We fault-in kernel-space virtual memory on-demand. The @@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address) if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -117,6 +118,7 @@ good_area: if (write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 217bcbfde42e..eb8830a4c5ed 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct task_struct *tsk; struct mm_struct *mm; int fault, sig, code; - int write = fsr & FSR_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; if (notify_page_fault(regs, fsr)) return 0; @@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 0bb7db41f4fe..6d6acf153bff 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -199,13 +199,6 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC; unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; - if (esr & ESR_LNX_EXEC) { - vm_flags = VM_EXEC; - } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) { - vm_flags = VM_WRITE; - mm_flags |= FAULT_FLAG_WRITE; - } - tsk = current; mm = tsk->mm; @@ -220,6 +213,16 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + mm_flags |= FAULT_FLAG_USER; + + if (esr & ESR_LNX_EXEC) { + vm_flags = VM_EXEC; + } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) { + vm_flags = VM_WRITE; + mm_flags |= FAULT_FLAG_WRITE; + } + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index 2ca27b055825..0eca93327195 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -86,6 +86,8 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) local_irq_enable(); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 73312ab6c696..1790f22e71a2 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -58,8 +58,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct vm_area_struct * vma; siginfo_t info; int fault; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - ((writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; D(printk(KERN_DEBUG "Page fault for %lX on %X at %lX, prot %d write %d\n", @@ -117,6 +116,8 @@ do_page_fault(unsigned long address, struct pt_regs *regs, if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -155,6 +156,7 @@ retry: } else if (writeaccess == 1) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index 331c1e2cfb67..9a66372fc7c7 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -34,11 +34,11 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; pte_t *pte; - int write; int fault; #if 0 @@ -81,6 +81,9 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear if (in_atomic() || !mm) goto no_context; + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + down_read(&mm->mmap_sem); vma = find_vma(mm, ear0); @@ -129,7 +132,6 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear */ good_area: info.si_code = SEGV_ACCERR; - write = 0; switch (esr0 & ESR0_ATXC) { default: /* handle write to write protected page */ @@ -140,7 +142,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear #endif if (!(vma->vm_flags & VM_WRITE)) goto bad_area; - write = 1; + flags |= FAULT_FLAG_WRITE; break; /* handle read from protected page */ @@ -162,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index 1bd276dbec7d..8704c9320032 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -53,8 +53,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) int si_code = SEGV_MAPERR; int fault; const struct exception_table_entry *fixup; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (cause > 0 ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* * If we're in an interrupt or have no user context, @@ -65,6 +64,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) local_irq_enable(); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -96,6 +97,7 @@ good_area: case FLT_STORE: if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; break; } diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 6cf0341f978e..7225dad87094 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -90,8 +90,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT) | (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT)); - flags |= ((mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); - /* mmap_sem is performance critical.... */ prefetchw(&mm->mmap_sem); @@ -119,6 +117,10 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if (notify_page_fault(regs, TRAP_BRKPT)) return; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; retry: down_read(&mm->mmap_sem); diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 3cdfa9c1d091..e9c6a8014bd6 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; - int write; + unsigned long flags = 0; int fault; siginfo_t info; @@ -117,6 +117,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, if (in_atomic() || !mm) goto bad_area_nosemaphore; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -166,14 +169,13 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, */ good_area: info.si_code = SEGV_ACCERR; - write = 0; switch (error_code & (ACE_WRITE|ACE_PROTECTION)) { default: /* 3: write, present */ /* fall through */ case ACE_WRITE: /* write, not present */ if (!(vma->vm_flags & VM_WRITE)) goto bad_area; - write++; + flags |= FAULT_FLAG_WRITE; break; case ACE_PROTECTION: /* read, present */ case 0: /* read, not present */ @@ -194,7 +196,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index a563727806bf..eb1d61f68725 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -88,6 +88,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c index 8fddf46e6c62..332680e5ebf2 100644 --- a/arch/metag/mm/fault.c +++ b/arch/metag/mm/fault.c @@ -53,8 +53,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, struct vm_area_struct *vma, *prev_vma; siginfo_t info; int fault; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write_access ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; @@ -109,6 +108,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); @@ -121,6 +122,7 @@ good_area: if (write_access) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index 731f739d17a1..fa4cf52aa7a6 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -92,8 +92,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, int code = SEGV_MAPERR; int is_write = error_code & ESR_S; int fault; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (is_write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; regs->ear = address; regs->esr = error_code; @@ -121,6 +120,9 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, die("Weird page fault", regs, SIGSEGV); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -199,6 +201,7 @@ good_area: if (unlikely(is_write)) { if (unlikely(!(vma->vm_flags & VM_WRITE))) goto bad_area; + flags |= FAULT_FLAG_WRITE; /* a read */ } else { /* protection fault */ diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 94d3a31ab144..becc42bb1849 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -42,8 +42,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, const int field = sizeof(unsigned long) * 2; siginfo_t info; int fault; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; #if 0 printk("Cpu%d[%s:%d:%0*lx:%ld:%0*lx]\n", raw_smp_processor_id(), @@ -93,6 +92,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, if (in_atomic() || !mm) goto bad_area_nosemaphore; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -114,6 +115,7 @@ good_area: if (write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (cpu_has_rixi) { if (address == regs->cp0_epc && !(vma->vm_flags & VM_EXEC)) { diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 8a2e6ded9a44..3516cbdf1ee9 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -171,6 +171,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, if (in_atomic() || !mm) goto no_context; + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index 4a41f8493ab0..0703acf7d327 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -86,6 +86,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) { /* Exception was in userspace: reenable interrupts */ local_irq_enable(); + flags |= FAULT_FLAG_USER; } else { /* If exception was in a syscall, then IRQ's may have * been enabled or disabled. If they were enabled, diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index f247a3480e8e..d10d27a720c0 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -180,6 +180,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; retry: down_read(&mm->mmap_sem); vma = find_vma_prev(mm, address, &prev_vma); @@ -203,8 +207,7 @@ good_area: * fault. */ - fault = handle_mm_fault(mm, vma, address, - flags | ((acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0)); + fault = handle_mm_fault(mm, vma, address, flags); if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) return; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 2dd69bf4af46..51ab9e7e6c39 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -223,9 +223,6 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, is_write = error_code & ESR_DST; #endif /* CONFIG_4xx || CONFIG_BOOKE */ - if (is_write) - flags |= FAULT_FLAG_WRITE; - #ifdef CONFIG_PPC_ICSWX /* * we need to do this early because this "data storage @@ -288,6 +285,9 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) store_update_sp = store_updates_sp(regs); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -415,6 +415,7 @@ good_area: } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; /* a read */ } else { /* protection fault */ diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 7de4469915f0..fc6679210d83 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -302,6 +302,8 @@ static inline int do_exception(struct pt_regs *regs, int access) address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 4b71a626d41e..52238983527d 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -75,6 +76,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, if (in_atomic() || !mm) goto bad_area_nosemaphore; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + down_read(&mm->mmap_sem); vma = find_vma(mm, address); if (!vma) @@ -95,6 +99,7 @@ good_area: if (write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) goto bad_area; @@ -105,7 +110,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index 1f49c28affa9..541dc6101508 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -400,9 +400,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct * vma; int fault; - int write = error_code & FAULT_CODE_WRITE; - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0)); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -476,6 +474,11 @@ good_area: set_thread_fault_code(error_code); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (error_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index e98bfda205a2..59dbd4645725 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -177,8 +177,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, unsigned long g2; int from_user = !(regs->psr & PSR_PS); int fault, code; - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0)); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; if (text_fault) address = regs->pc; @@ -235,6 +234,11 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -383,6 +387,7 @@ static void force_user_fault(unsigned long address, int write) struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned int flags = FAULT_FLAG_USER; int code; code = SEGV_MAPERR; @@ -402,11 +407,12 @@ good_area: if (write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } - switch (handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0)) { + switch (handle_mm_fault(mm, vma, address, flags)) { case VM_FAULT_SIGBUS: case VM_FAULT_OOM: goto do_sigbus; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 5062ff389e83..2ebec263d685 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -315,7 +315,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) bad_kernel_pc(regs, address); return; } - } + } else + flags |= FAULT_FLAG_USER; /* * If we're in an interrupt or have no user @@ -418,13 +419,14 @@ good_area: vma->vm_file != NULL) set_thread_fault_code(fault_code | FAULT_CODE_BLKCOMMIT); + + flags |= FAULT_FLAG_WRITE; } else { /* Allow reads even for write-only mappings */ if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } - flags |= ((fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); fault = handle_mm_fault(mm, vma, address, flags); if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 4fd2c0f2a66d..4c288f199453 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -280,8 +280,7 @@ static int handle_page_fault(struct pt_regs *regs, if (!is_page_fault) write = 1; - flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0)); + flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; is_kernel_mode = !user_mode(regs); @@ -365,6 +364,9 @@ static int handle_page_fault(struct pt_regs *regs, goto bad_area_nosemaphore; } + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + /* * When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the @@ -425,6 +427,7 @@ good_area: #endif if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!is_page_fault || !(vma->vm_flags & VM_READ)) goto bad_area; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index b2f5adf838dd..5c3aef74237f 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -30,8 +30,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, pmd_t *pmd; pte_t *pte; int err = -EFAULT; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (is_write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; *code_out = SEGV_MAPERR; @@ -42,6 +41,8 @@ int handle_page_fault(unsigned long address, unsigned long ip, if (in_atomic()) goto out_nosemaphore; + if (is_user) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -58,12 +59,15 @@ retry: good_area: *code_out = SEGV_ACCERR; - if (is_write && !(vma->vm_flags & VM_WRITE)) - goto out; - - /* Don't require VM_READ|VM_EXEC for write faults! */ - if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) - goto out; + if (is_write) { + if (!(vma->vm_flags & VM_WRITE)) + goto out; + flags |= FAULT_FLAG_WRITE; + } else { + /* Don't require VM_READ|VM_EXEC for write faults! */ + if (!(vma->vm_flags & (VM_READ | VM_EXEC))) + goto out; + } do { int fault; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 8ed3c4509d84..0dc922dba915 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -209,8 +209,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct task_struct *tsk; struct mm_struct *mm; int fault, sig, code; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - ((!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -222,6 +221,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 654be4ae3047..6d77c3866faa 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1011,9 +1011,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) unsigned long address; struct mm_struct *mm; int fault; - int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1083,6 +1081,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) if (user_mode_vm(regs)) { local_irq_enable(); error_code |= PF_USER; + flags |= FAULT_FLAG_USER; } else { if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); @@ -1109,6 +1108,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) return; } + if (error_code & PF_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index 4b7bc8db170f..70fa7bc42b4a 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -72,6 +72,8 @@ void do_page_fault(struct pt_regs *regs) address, exccause, regs->pc, is_write? "w":"", is_exec? "x":""); #endif + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); diff --git a/include/linux/mm.h b/include/linux/mm.h index caf543c7eaa7..3d9b503bcd00 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -176,6 +176,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */ +#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */ /* * vm_fault is filled by the the pagefault handler and passed to the vma's -- cgit v1.2.3 From 519e52473ebe9db5cdef44670d5a97f1fd53d721 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 12 Sep 2013 15:13:42 -0700 Subject: mm: memcg: enable memcg OOM killer only for user faults System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: David Rientjes Cc: KAMEZAWA Hiroyuki Cc: azurIt Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 3 +++ mm/filemap.c | 11 ++++++++++- mm/memcontrol.c | 2 +- mm/memory.c | 40 ++++++++++++++++++++++++++++++---------- 5 files changed, 88 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d4d1f9b0dbba..34ac6497d01a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -151,6 +151,37 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); +/** + * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task + * @new: true to enable, false to disable + * + * Toggle whether a failed memcg charge should invoke the OOM killer + * or just return -ENOMEM. Returns the previous toggle state. + */ +static inline bool mem_cgroup_toggle_oom(bool new) +{ + bool old; + + old = current->memcg_oom.may_oom; + current->memcg_oom.may_oom = new; + + return old; +} + +static inline void mem_cgroup_enable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(true); + + WARN_ON(old == true); +} + +static inline void mem_cgroup_disable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(false); + + WARN_ON(old == false); +} + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -383,6 +414,19 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page, { } +static inline bool mem_cgroup_toggle_oom(bool new) +{ + return false; +} + +static inline void mem_cgroup_enable_oom(void) +{ +} + +static inline void mem_cgroup_disable_oom(void) +{ +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 45f254dddafc..9ce1fa53031f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1393,6 +1393,9 @@ struct task_struct { unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; unsigned int memcg_kmem_skip_account; + struct memcg_oom_info { + unsigned int may_oom:1; + } memcg_oom; #endif #ifdef CONFIG_UPROBES struct uprobe_task *utask; diff --git a/mm/filemap.c b/mm/filemap.c index e607728db4a8..e3b6fc8c0b7b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1614,6 +1614,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; struct page *page; + bool memcg_oom; pgoff_t size; int ret = 0; @@ -1622,7 +1623,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { @@ -1630,10 +1635,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + memcg_oom = mem_cgroup_toggle_oom(false); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_toggle_oom(memcg_oom); } else if (!page) { /* No page in the page cache at all */ + memcg_oom = mem_cgroup_toggle_oom(false); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_toggle_oom(memcg_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c4524458b7d0..0980bbf6438d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2454,7 +2454,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return CHARGE_RETRY; /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) + if (!oom_check || !current->memcg_oom.may_oom) return CHARGE_NOMEM; /* check OOM */ if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize))) diff --git a/mm/memory.c b/mm/memory.c index 2b73dbde2274..a8f9deab8719 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3754,22 +3754,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3850,6 +3842,34 @@ retry: return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + /* + * Enable the memcg OOM handling for faults triggered in user + * space. Kernel faults are handled more gracefully. + */ + if (flags & FAULT_FLAG_USER) + mem_cgroup_enable_oom(); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (flags & FAULT_FLAG_USER) + mem_cgroup_disable_oom(); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. -- cgit v1.2.3 From 3812c8c8f3953921ef18544110dafc3505c1ac62 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 12 Sep 2013 15:13:44 -0700 Subject: mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner Reported-by: azurIt Acked-by: Michal Hocko Cc: David Rientjes Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 21 +++++++ include/linux/sched.h | 4 ++ mm/memcontrol.c | 154 +++++++++++++++++++++++++++++++-------------- mm/memory.c | 3 + mm/oom_kill.c | 7 ++- 5 files changed, 140 insertions(+), 49 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 34ac6497d01a..89d576cfcc4c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -157,6 +157,10 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage, * * Toggle whether a failed memcg charge should invoke the OOM killer * or just return -ENOMEM. Returns the previous toggle state. + * + * NOTE: Any path that enables the OOM killer before charging must + * call mem_cgroup_oom_synchronize() afterward to finalize the + * OOM handling and clean up. */ static inline bool mem_cgroup_toggle_oom(bool new) { @@ -182,6 +186,13 @@ static inline void mem_cgroup_disable_oom(void) WARN_ON(old == false); } +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return p->memcg_oom.in_memcg_oom; +} + +bool mem_cgroup_oom_synchronize(void); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -427,6 +438,16 @@ static inline void mem_cgroup_disable_oom(void) { } +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return false; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 9ce1fa53031f..6682da36b293 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1395,6 +1395,10 @@ struct task_struct { unsigned int memcg_kmem_skip_account; struct memcg_oom_info { unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + unsigned int oom_locked:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; } memcg_oom; #endif #ifdef CONFIG_UPROBES diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 04250cbf46c6..4b5cfb509270 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -255,6 +255,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; int swappiness; /* OOM-Killer disable */ @@ -2020,6 +2021,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -2031,19 +2033,17 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - struct oom_wait_info owait; bool locked; + int wakeups; - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* * As with any blocking lock, a contender needs to start @@ -2051,12 +2051,8 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, * otherwise it can miss the wakeup from the unlock and sleep * indefinitely. This is just open-coded because our locking * is so particular to memcg hierarchies. - * - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + wakeups = atomic_read(&memcg->oom_wakeups); mem_cgroup_mark_under_oom(memcg); locked = mem_cgroup_oom_trylock(memcg); @@ -2066,15 +2062,95 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (locked && !memcg->oom_kill_disable) { mem_cgroup_unmark_under_oom(memcg); - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_oom_unlock(memcg); + /* + * There is no guarantee that an OOM-lock contender + * sees the wakeups triggered by the OOM kill + * uncharges. Wake any sleepers explicitely. + */ + memcg_oom_recover(memcg); } else { - schedule(); - mem_cgroup_unmark_under_oom(memcg); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + current->memcg_oom.oom_locked = locked; + current->memcg_oom.wakeups = wakeups; + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } +} + +/** + * mem_cgroup_oom_synchronize - complete memcg OOM handling + * + * This has to be called at the end of a page fault if the the memcg + * OOM handler was enabled and the fault is returning %VM_FAULT_OOM. + * + * Memcg supports userspace OOM handling, so failed allocations must + * sleep on a waitqueue until the userspace task resolves the + * situation. Sleeping directly in the charge context with all kinds + * of locks held is not a good idea, instead we remember an OOM state + * in the task and mem_cgroup_oom_synchronize() has to be called at + * the end of the page fault to put the task to sleep and clean up the + * OOM state. + * + * Returns %true if an ongoing memcg OOM situation was detected and + * finalized, %false otherwise. + */ +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; + + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) + return false; + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_memcg; + + owait.memcg = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); - if (locked) { + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_memcg: + mem_cgroup_unmark_under_oom(memcg); + if (current->memcg_oom.oom_locked) { mem_cgroup_oom_unlock(memcg); /* * There is no guarantee that an OOM-lock contender @@ -2083,11 +2159,10 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, */ memcg_oom_recover(memcg); } - - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2400,12 +2475,11 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages, unsigned int min_pages, - bool oom_check) + bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2462,14 +2536,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check || !current->memcg_oom.may_oom) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize))) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize)); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2572,7 +2642,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2580,14 +2650,8 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages, - oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, + nr_pages, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2600,16 +2664,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index a8f9deab8719..5ec6f199e685 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3867,6 +3867,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_USER) mem_cgroup_disable_oom(); + if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) + mem_cgroup_oom_synchronize(); + return ret; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 98e75f2ac7bc..314e9d274381 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -678,9 +678,12 @@ out: */ void pagefault_out_of_memory(void) { - struct zonelist *zonelist = node_zonelist(first_online_node, - GFP_KERNEL); + struct zonelist *zonelist; + if (mem_cgroup_oom_synchronize()) + return; + + zonelist = node_zonelist(first_online_node, GFP_KERNEL); if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) { out_of_memory(NULL, 0, 0, NULL, false); clear_zonelist_oom(zonelist, GFP_KERNEL); -- cgit v1.2.3 From 34ff8dc08956098563989d8599840b130be81252 Mon Sep 17 00:00:00 2001 From: Sha Zhengju Date: Thu, 12 Sep 2013 15:13:46 -0700 Subject: memcg: correct RESOURCE_MAX to ULLONG_MAX Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource limit is unsigned long long, so we can set bigger value than that which is strange. The XXX_MAX should be reasonable max value, bigger than that should be overflow. Notice that this change will affect user output of default *.limit_in_bytes: before change: $ cat /cgroup/memory/memory.limit_in_bytes 9223372036854775807 after change: $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 But it doesn't alter the API in term of input - we can still use "echo -1 > *.limit_in_bytes" to reset the numbers to "unlimited". Signed-off-by: Sha Zhengju Signed-off-by: Qiang Huang Acked-by: Michal Hocko Cc: Daisuke Nishimura Cc: Jeff Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/res_counter.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 96a509b6be04..586bc7cb1dcf 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -54,7 +54,7 @@ struct res_counter { struct res_counter *parent; }; -#define RESOURCE_MAX (unsigned long long)LLONG_MAX +#define RESOURCE_MAX ULLONG_MAX /** * Helpers to interact with userspace -- cgit v1.2.3 From 6de5a8bfcae6e3b427d642eff078d8305b324b52 Mon Sep 17 00:00:00 2001 From: Sha Zhengju Date: Thu, 12 Sep 2013 15:13:47 -0700 Subject: memcg: rename RESOURCE_MAX to RES_COUNTER_MAX RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju Signed-off-by: Qiang Huang Acked-by: Michal Hocko Cc: Daisuke Nishimura Cc: Jeff Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/res_counter.h | 2 +- kernel/res_counter.c | 8 ++++---- mm/memcontrol.c | 4 ++-- net/ipv4/tcp_memcontrol.c | 10 +++++----- 4 files changed, 12 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 586bc7cb1dcf..201a69749659 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -54,7 +54,7 @@ struct res_counter { struct res_counter *parent; }; -#define RESOURCE_MAX ULLONG_MAX +#define RES_COUNTER_MAX ULLONG_MAX /** * Helpers to interact with userspace diff --git a/kernel/res_counter.c b/kernel/res_counter.c index ff55247e7049..3f0417f97e76 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -17,8 +17,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) { spin_lock_init(&counter->lock); - counter->limit = RESOURCE_MAX; - counter->soft_limit = RESOURCE_MAX; + counter->limit = RES_COUNTER_MAX; + counter->soft_limit = RES_COUNTER_MAX; counter->parent = parent; } @@ -182,12 +182,12 @@ int res_counter_memparse_write_strategy(const char *buf, { char *end; - /* return RESOURCE_MAX(unlimited) if "-1" is specified */ + /* return RES_COUNTER_MAX(unlimited) if "-1" is specified */ if (*buf == '-') { *res = simple_strtoull(buf + 1, &end, 10); if (*res != 1 || *end != '\0') return -EINVAL; - *res = RESOURCE_MAX; + *res = RES_COUNTER_MAX; return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4b5cfb509270..2c71f243186e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4967,7 +4967,7 @@ static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val) */ mutex_lock(&memcg_create_mutex); mutex_lock(&set_limit_mutex); - if (!memcg->kmem_account_flags && val != RESOURCE_MAX) { + if (!memcg->kmem_account_flags && val != RES_COUNTER_MAX) { if (cgroup_task_count(css->cgroup) || memcg_has_children(memcg)) { ret = -EBUSY; goto out; @@ -4977,7 +4977,7 @@ static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val) ret = memcg_update_cache_sizes(memcg); if (ret) { - res_counter_set_limit(&memcg->kmem, RESOURCE_MAX); + res_counter_set_limit(&memcg->kmem, RES_COUNTER_MAX); goto out; } static_key_slow_inc(&memcg_kmem_enabled_key); diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index 8a57d79b0b16..559d4ae6ebf4 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -87,8 +87,8 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val) if (!cg_proto) return -EINVAL; - if (val > RESOURCE_MAX) - val = RESOURCE_MAX; + if (val > RES_COUNTER_MAX) + val = RES_COUNTER_MAX; tcp = tcp_from_cgproto(cg_proto); @@ -101,9 +101,9 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val) tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT, net->ipv4.sysctl_tcp_mem[i]); - if (val == RESOURCE_MAX) + if (val == RES_COUNTER_MAX) clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); - else if (val != RESOURCE_MAX) { + else if (val != RES_COUNTER_MAX) { /* * The active bit needs to be written after the static_key * update. This is what guarantees that the socket activation @@ -187,7 +187,7 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) switch (cft->private) { case RES_LIMIT: - val = tcp_read_stat(memcg, RES_LIMIT, RESOURCE_MAX); + val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX); break; case RES_USAGE: val = tcp_read_usage(memcg); -- cgit v1.2.3 From 68b4876d996e8749142b2895bc2e251448996363 Mon Sep 17 00:00:00 2001 From: Sha Zhengju Date: Thu, 12 Sep 2013 15:13:50 -0700 Subject: memcg: remove MEMCG_NR_FILE_MAPPED While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: Sha Zhengju Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko Acked-by: Fengguang Wu Reviewed-by: Greg Thelen Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 27 +++++++++++++++++++-------- mm/memcontrol.c | 25 +------------------------ mm/rmap.c | 4 ++-- 3 files changed, 22 insertions(+), 34 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 89d576cfcc4c..f952a07ce7a9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,9 +30,20 @@ struct page; struct mm_struct; struct kmem_cache; -/* Stats that can be updated by kernel. */ -enum mem_cgroup_page_stat_item { - MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */ +/* + * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c, + * These two lists should keep in accord with each other. + */ +enum mem_cgroup_stat_index { + /* + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. + */ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_NSTATS, }; struct mem_cgroup_reclaim_cookie { @@ -233,17 +244,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page, } void mem_cgroup_update_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx, + enum mem_cgroup_stat_index idx, int val); static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) + enum mem_cgroup_stat_index idx) { mem_cgroup_update_page_stat(page, idx, 1); } static inline void mem_cgroup_dec_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) + enum mem_cgroup_stat_index idx) { mem_cgroup_update_page_stat(page, idx, -1); } @@ -449,12 +460,12 @@ static inline bool mem_cgroup_oom_synchronize(void) } static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) + enum mem_cgroup_stat_index idx) { } static inline void mem_cgroup_dec_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) + enum mem_cgroup_stat_index idx) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2c71f243186e..f3803462a7e6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -84,21 +84,6 @@ static int really_do_swap_account __initdata = 0; #endif -/* - * Statistics for memory cgroup. - */ -enum mem_cgroup_stat_index { - /* - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. - */ - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ - MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ - MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ - MEM_CGROUP_STAT_NSTATS, -}; - static const char * const mem_cgroup_stat_names[] = { "cache", "rss", @@ -2231,7 +2216,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) } void mem_cgroup_update_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx, int val) + enum mem_cgroup_stat_index idx, int val) { struct mem_cgroup *memcg; struct page_cgroup *pc = lookup_page_cgroup(page); @@ -2244,14 +2229,6 @@ void mem_cgroup_update_page_stat(struct page *page, if (unlikely(!memcg || !PageCgroupUsed(pc))) return; - switch (idx) { - case MEMCG_NR_FILE_MAPPED: - idx = MEM_CGROUP_STAT_FILE_MAPPED; - break; - default: - BUG(); - } - this_cpu_add(memcg->stat->count[idx], val); } diff --git a/mm/rmap.c b/mm/rmap.c index 07748e68b729..a21c976a8ec1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1111,7 +1111,7 @@ void page_add_file_rmap(struct page *page) mem_cgroup_begin_update_page_stat(page, &locked, &flags); if (atomic_inc_and_test(&page->_mapcount)) { __inc_zone_page_state(page, NR_FILE_MAPPED); - mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED); + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED); } mem_cgroup_end_update_page_stat(page, &locked, &flags); } @@ -1155,7 +1155,7 @@ void page_remove_rmap(struct page *page) NR_ANON_TRANSPARENT_HUGEPAGES); } else { __dec_zone_page_state(page, NR_FILE_MAPPED); - mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED); + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED); mem_cgroup_end_update_page_stat(page, &locked, &flags); } if (unlikely(PageMlocked(page))) -- cgit v1.2.3 From 3ea67d06e4679a16f69f66f43a8d6ee4778985fc Mon Sep 17 00:00:00 2001 From: Sha Zhengju Date: Thu, 12 Sep 2013 15:13:53 -0700 Subject: memcg: add per cgroup writeback pages accounting Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: Sha Zhengju Acked-by: Michal Hocko Reviewed-by: Greg Thelen Cc: Fengguang Wu Cc: KAMEZAWA Hiroyuki Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 30 +++++++++++++++++++++++------- mm/page-writeback.c | 15 +++++++++++++++ 3 files changed, 39 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f952a07ce7a9..60e95872da29 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -42,6 +42,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0093bc36c5fc..d5ff3ce13029 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -89,6 +89,7 @@ static const char * const mem_cgroup_stat_names[] = { "rss", "rss_huge", "mapped_file", + "writeback", "swap", }; @@ -3654,6 +3655,20 @@ void mem_cgroup_split_huge_fixup(struct page *head) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, + struct mem_cgroup *to, + unsigned int nr_pages, + enum mem_cgroup_stat_index idx) +{ + /* Update stat data for mem_cgroup */ + preempt_disable(); + WARN_ON_ONCE(from->stat->count[idx] < nr_pages); + __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_add(to->stat->count[idx], nr_pages); + preempt_enable(); +} + /** * mem_cgroup_move_account - move account of the page * @page: the page @@ -3699,13 +3714,14 @@ static int mem_cgroup_move_account(struct page *page, move_lock_mem_cgroup(from, &flags); - if (!anon && page_mapped(page)) { - /* Update mapped_file data for mem_cgroup */ - preempt_disable(); - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); - preempt_enable(); - } + if (!anon && page_mapped(page)) + mem_cgroup_move_account_page_stat(from, to, nr_pages, + MEM_CGROUP_STAT_FILE_MAPPED); + + if (PageWriteback(page)) + mem_cgroup_move_account_page_stat(from, to, nr_pages, + MEM_CGROUP_STAT_WRITEBACK); + mem_cgroup_charge_statistics(from, page, anon, -nr_pages); /* caller should have done css_get */ diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 6c7b0187be8e..f5236f804aa6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2143,11 +2143,17 @@ EXPORT_SYMBOL(account_page_dirtied); /* * Helper function for set_page_writeback family. + * + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock + * while calling this function. + * See test_set_page_writeback for example. + * * NOTE: Unlike account_page_dirtied this does not rely on being atomic * wrt interrupts. */ void account_page_writeback(struct page *page) { + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_zone_page_state(page, NR_WRITEBACK); } EXPORT_SYMBOL(account_page_writeback); @@ -2364,7 +2370,10 @@ int test_clear_page_writeback(struct page *page) { struct address_space *mapping = page_mapping(page); int ret; + bool locked; + unsigned long memcg_flags; + mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags); if (mapping) { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; @@ -2385,9 +2394,11 @@ int test_clear_page_writeback(struct page *page) ret = TestClearPageWriteback(page); } if (ret) { + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_zone_page_state(page, NR_WRITEBACK); inc_zone_page_state(page, NR_WRITTEN); } + mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags); return ret; } @@ -2395,7 +2406,10 @@ int test_set_page_writeback(struct page *page) { struct address_space *mapping = page_mapping(page); int ret; + bool locked; + unsigned long memcg_flags; + mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags); if (mapping) { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; @@ -2422,6 +2436,7 @@ int test_set_page_writeback(struct page *page) } if (!ret) account_page_writeback(page); + mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags); return ret; } -- cgit v1.2.3 From 5fbc461636c32efdb9d5216d491d37a40d54535b Mon Sep 17 00:00:00 2001 From: Chris Metcalf Date: Thu, 12 Sep 2013 15:13:55 -0700 Subject: mm: make lru_add_drain_all() selective make lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. This is important on workloads where nohz cores are handling 10 Gb traffic in userspace. Those CPUs do not enter the kernel and place pages into LRU pagevecs and they really, really don't want to be interrupted, or they drop packets on the floor. Signed-off-by: Chris Metcalf Reviewed-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 2 +- mm/swap.c | 44 +++++++++++++++++++++++++++++++++++++++----- 2 files changed, 40 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index c03c139219c9..46ba0c6c219f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -280,7 +280,7 @@ extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); -extern int lru_add_drain_all(void); +extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_page(struct page *page); extern void swap_setup(void); diff --git a/mm/swap.c b/mm/swap.c index c899502d3e36..759c3caf44bd 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -432,6 +432,11 @@ static void activate_page_drain(int cpu) pagevec_lru_move_fn(pvec, __activate_page, NULL); } +static bool need_activate_page_drain(int cpu) +{ + return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0; +} + void activate_page(struct page *page) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { @@ -449,6 +454,11 @@ static inline void activate_page_drain(int cpu) { } +static bool need_activate_page_drain(int cpu) +{ + return false; +} + void activate_page(struct page *page) { struct zone *zone = page_zone(page); @@ -701,12 +711,36 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) lru_add_drain(); } -/* - * Returns 0 for success - */ -int lru_add_drain_all(void) +static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); + +void lru_add_drain_all(void) { - return schedule_on_each_cpu(lru_add_drain_per_cpu); + static DEFINE_MUTEX(lock); + static struct cpumask has_work; + int cpu; + + mutex_lock(&lock); + get_online_cpus(); + cpumask_clear(&has_work); + + for_each_online_cpu(cpu) { + struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); + + if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || + need_activate_page_drain(cpu)) { + INIT_WORK(work, lru_add_drain_per_cpu); + schedule_work_on(cpu, work); + cpumask_set_cpu(cpu, &has_work); + } + } + + for_each_cpu(cpu, &has_work) + flush_work(&per_cpu(lru_add_drain_work, cpu)); + + put_online_cpus(); + mutex_unlock(&lock); } /* -- cgit v1.2.3 From 7caef26767c1727d7abfbbbfbe8b2bb473430d48 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 12 Sep 2013 15:13:56 -0700 Subject: truncate: drop 'oldsize' truncate_pagecache() parameter truncate_pagecache() doesn't care about old size since commit cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov Cc: OGAWA Hirofumi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/adfs/inode.c | 2 +- fs/affs/file.c | 2 +- fs/bfs/file.c | 2 +- fs/btrfs/free-space-cache.c | 4 +--- fs/btrfs/inode.c | 2 +- fs/cifs/inode.c | 5 +---- fs/exofs/inode.c | 2 +- fs/ext2/inode.c | 2 +- fs/ext4/inode.c | 3 +-- fs/fat/inode.c | 2 +- fs/fuse/dir.c | 2 +- fs/fuse/inode.c | 2 +- fs/gfs2/bmap.c | 4 ++-- fs/hfs/inode.c | 2 +- fs/hfsplus/inode.c | 2 +- fs/hpfs/file.c | 2 +- fs/jfs/inode.c | 2 +- fs/minix/inode.c | 2 +- fs/nfs/inode.c | 4 +--- fs/nilfs2/inode.c | 2 +- fs/ntfs/file.c | 2 +- fs/omfs/file.c | 2 +- fs/sysv/itree.c | 2 +- fs/udf/inode.c | 2 +- fs/ufs/inode.c | 2 +- fs/xfs/xfs_aops.c | 4 ++-- include/linux/mm.h | 2 +- mm/truncate.c | 9 ++------- 28 files changed, 31 insertions(+), 44 deletions(-) (limited to 'include') diff --git a/fs/adfs/inode.c b/fs/adfs/inode.c index 5f95d1ed9c6d..b9acadafa4a1 100644 --- a/fs/adfs/inode.c +++ b/fs/adfs/inode.c @@ -50,7 +50,7 @@ static void adfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); } static int adfs_write_begin(struct file *file, struct address_space *mapping, diff --git a/fs/affs/file.c b/fs/affs/file.c index 776e3935a758..8669b6ecddee 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -406,7 +406,7 @@ static void affs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); affs_truncate(inode); } } diff --git a/fs/bfs/file.c b/fs/bfs/file.c index ad3ea1497cc3..ae2892218335 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -166,7 +166,7 @@ static void bfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); } static int bfs_write_begin(struct file *file, struct address_space *mapping, diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index b21a3cd667d8..256d9597f0cf 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -221,12 +221,10 @@ int btrfs_truncate_free_space_cache(struct btrfs_root *root, struct btrfs_path *path, struct inode *inode) { - loff_t oldsize; int ret = 0; - oldsize = i_size_read(inode); btrfs_i_size_write(inode, 0); - truncate_pagecache(inode, oldsize, 0); + truncate_pagecache(inode, 0); /* * We don't need an orphan item because truncating the free space cache diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7bdc83d04d54..82fdd7059d10 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4409,7 +4409,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) inode->i_ctime = inode->i_mtime = current_fs_time(inode->i_sb); if (newsize > oldsize) { - truncate_pagecache(inode, oldsize, newsize); + truncate_pagecache(inode, newsize); ret = btrfs_cont_expand(inode, oldsize, newsize); if (ret) return ret; diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index 449b6cf09b09..2a92c5c6ecfd 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -1852,14 +1852,11 @@ static int cifs_truncate_page(struct address_space *mapping, loff_t from) static void cifs_setsize(struct inode *inode, loff_t offset) { - loff_t oldsize; - spin_lock(&inode->i_lock); - oldsize = inode->i_size; i_size_write(inode, offset); spin_unlock(&inode->i_lock); - truncate_pagecache(inode, oldsize, offset); + truncate_pagecache(inode, offset); } static int diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index 2ec8eb1ab269..a52a5d23c30b 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -861,7 +861,7 @@ static int exofs_writepage(struct page *page, struct writeback_control *wbc) static void _write_failed(struct inode *inode, loff_t to) { if (to > inode->i_size) - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); } int exofs_write_begin(struct file *file, struct address_space *mapping, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 0a87bb10998d..c260de6d7b6d 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -58,7 +58,7 @@ static void ext2_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); ext2_truncate_blocks(inode, inode->i_size); } } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c79fd7dabe79..0d424d7ac02b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4587,7 +4587,6 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) if (attr->ia_valid & ATTR_SIZE && attr->ia_size != inode->i_size) { handle_t *handle; - loff_t oldsize = inode->i_size; if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); @@ -4650,7 +4649,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. */ - truncate_pagecache(inode, oldsize, inode->i_size); + truncate_pagecache(inode, inode->i_size); } /* * We want to call ext4_truncate() even if attr->ia_size == diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 11b51bb55b42..0062da21dd8b 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -147,7 +147,7 @@ static void fat_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); fat_truncate_blocks(inode, inode->i_size); } } diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 3ac91086f41f..62b43b577bfc 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1678,7 +1678,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock. */ if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { - truncate_pagecache(inode, oldsize, outarg.attr.size); + truncate_pagecache(inode, outarg.attr.size); invalidate_inode_pages2(inode->i_mapping); } diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 84434594e80e..a8ce6dab60a0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -218,7 +218,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, bool inval = false; if (oldsize != attr->size) { - truncate_pagecache(inode, oldsize, attr->size); + truncate_pagecache(inode, attr->size); inval = true; } else if (fc->auto_inval_data) { struct timespec new_mtime = { diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 5e2f56fccf6b..62a65fc448dc 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -1016,7 +1016,7 @@ static int gfs2_journaled_truncate(struct inode *inode, u64 oldsize, u64 newsize chunk = oldsize - newsize; if (chunk > max_chunk) chunk = max_chunk; - truncate_pagecache(inode, oldsize, oldsize - chunk); + truncate_pagecache(inode, oldsize - chunk); oldsize -= chunk; gfs2_trans_end(sdp); error = gfs2_trans_begin(sdp, RES_DINODE, GFS2_JTRUNC_REVOKES); @@ -1067,7 +1067,7 @@ static int trunc_start(struct inode *inode, u64 oldsize, u64 newsize) if (journaled) error = gfs2_journaled_truncate(inode, oldsize, newsize); else - truncate_pagecache(inode, oldsize, newsize); + truncate_pagecache(inode, newsize); if (error) { brelse(dibh); diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index f9299d8a64e3..380ab31b5e0f 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -41,7 +41,7 @@ static void hfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); hfs_file_truncate(inode); } } diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 4d2edaea891c..37213d075f3c 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -36,7 +36,7 @@ static void hfsplus_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); hfsplus_file_truncate(inode); } } diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c index 4e9dabcf1f4c..67c1a61e0955 100644 --- a/fs/hpfs/file.c +++ b/fs/hpfs/file.c @@ -138,7 +138,7 @@ static void hpfs_write_failed(struct address_space *mapping, loff_t to) hpfs_lock(inode->i_sb); if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); hpfs_truncate(inode); } diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c index 730f24e282a6..f4aab719add5 100644 --- a/fs/jfs/inode.c +++ b/fs/jfs/inode.c @@ -306,7 +306,7 @@ static void jfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); jfs_truncate(inode); } } diff --git a/fs/minix/inode.c b/fs/minix/inode.c index df122496f328..0332109162a5 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -400,7 +400,7 @@ static void minix_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); minix_truncate(inode); } } diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 87e797640828..eda8879171c4 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -541,7 +541,6 @@ EXPORT_SYMBOL_GPL(nfs_setattr); */ static int nfs_vmtruncate(struct inode * inode, loff_t offset) { - loff_t oldsize; int err; err = inode_newsize_ok(inode, offset); @@ -549,11 +548,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset) goto out; spin_lock(&inode->i_lock); - oldsize = inode->i_size; i_size_write(inode, offset); spin_unlock(&inode->i_lock); - truncate_pagecache(inode, oldsize, offset); + truncate_pagecache(inode, offset); out: return err; } diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index b1a5277cfd18..7e350c562e0e 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -254,7 +254,7 @@ void nilfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); nilfs_truncate(inode); } } diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c index c5670b8d198c..ea4ba9daeb47 100644 --- a/fs/ntfs/file.c +++ b/fs/ntfs/file.c @@ -1768,7 +1768,7 @@ static void ntfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); ntfs_truncate_vfs(inode); } } diff --git a/fs/omfs/file.c b/fs/omfs/file.c index e0d9b3e722bd..54d57d6ba68d 100644 --- a/fs/omfs/file.c +++ b/fs/omfs/file.c @@ -311,7 +311,7 @@ static void omfs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); omfs_truncate(inode); } } diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c index c1a591a4725b..66bc316927e8 100644 --- a/fs/sysv/itree.c +++ b/fs/sysv/itree.c @@ -469,7 +469,7 @@ static void sysv_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) { - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); sysv_truncate(inode); } } diff --git a/fs/udf/inode.c b/fs/udf/inode.c index b6d15d349810..062b7925bca0 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -172,7 +172,7 @@ static void udf_write_failed(struct address_space *mapping, loff_t to) loff_t isize = inode->i_size; if (to > isize) { - truncate_pagecache(inode, to, isize); + truncate_pagecache(inode, isize); if (iinfo->i_alloc_type != ICBTAG_FLAG_AD_IN_ICB) { down_write(&iinfo->i_data_sem); udf_clear_extent_cache(inode); diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c index ff24e4449ece..c8ca96086784 100644 --- a/fs/ufs/inode.c +++ b/fs/ufs/inode.c @@ -531,7 +531,7 @@ static void ufs_write_failed(struct address_space *mapping, loff_t to) struct inode *inode = mapping->host; if (to > inode->i_size) - truncate_pagecache(inode, to, inode->i_size); + truncate_pagecache(inode, inode->i_size); } static int ufs_write_begin(struct file *file, struct address_space *mapping, diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 977da0ec6604..e51e581454e9 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1582,7 +1582,7 @@ xfs_vm_write_begin( unlock_page(page); if (pos + len > i_size_read(inode)) - truncate_pagecache(inode, pos + len, i_size_read(inode)); + truncate_pagecache(inode, i_size_read(inode)); page_cache_release(page); page = NULL; @@ -1618,7 +1618,7 @@ xfs_vm_write_end( loff_t to = pos + len; if (to > isize) { - truncate_pagecache(inode, to, isize); + truncate_pagecache(inode, isize); xfs_vm_kill_delalloc_range(inode, isize, to); } } diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d9b503bcd00..928df792c005 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -985,7 +985,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, unmap_mapping_range(mapping, holebegin, holelen, 0); } -extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new); +extern void truncate_pagecache(struct inode *inode, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); diff --git a/mm/truncate.c b/mm/truncate.c index e2e8a8a7eb9d..353b683afd6e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -567,7 +567,6 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages2); /** * truncate_pagecache - unmap and remove pagecache that has been truncated * @inode: inode - * @oldsize: old file size * @newsize: new file size * * inode's new i_size must already be written before truncate_pagecache @@ -580,7 +579,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages2); * situations such as writepage being called for a page that has already * had its underlying blocks deallocated. */ -void truncate_pagecache(struct inode *inode, loff_t oldsize, loff_t newsize) +void truncate_pagecache(struct inode *inode, loff_t newsize) { struct address_space *mapping = inode->i_mapping; loff_t holebegin = round_up(newsize, PAGE_SIZE); @@ -614,12 +613,8 @@ EXPORT_SYMBOL(truncate_pagecache); */ void truncate_setsize(struct inode *inode, loff_t newsize) { - loff_t oldsize; - - oldsize = inode->i_size; i_size_write(inode, newsize); - - truncate_pagecache(inode, oldsize, newsize); + truncate_pagecache(inode, newsize); } EXPORT_SYMBOL(truncate_setsize); -- cgit v1.2.3 From c02925540ca7019465a43c00f8a3c0186ddace2b Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 12 Sep 2013 15:14:05 -0700 Subject: thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: Kirill A. Shutemov Acked-by: Hillf Danton Cc: Andrea Arcangeli Cc: Al Viro Cc: Hugh Dickins Cc: Wu Fengguang Cc: Jan Kara Cc: Mel Gorman Cc: Andi Kleen Cc: Matthew Wilcox Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 3 --- include/linux/mm.h | 3 ++- mm/huge_memory.c | 31 +++++-------------------------- mm/memory.c | 9 ++++++--- 4 files changed, 13 insertions(+), 33 deletions(-) (limited to 'include') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b60de92e2edc..3935428c57cf 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -96,9 +96,6 @@ extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, unsigned long addr, unsigned long end); -extern int handle_pte_fault(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long address, - pte_t *pte, pmd_t *pmd, unsigned int flags); extern int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 928df792c005..8b6e55ee8855 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -877,11 +877,12 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ +#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */ #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */ #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \ - VM_FAULT_HWPOISON_LARGE) + VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE) /* Encode hstate index for a hwpoisoned large page */ #define VM_FAULT_SET_HINDEX(x) ((x) << 12) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6551dd06dd64..243f4cc75777 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -783,10 +783,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, { struct page *page; unsigned long haddr = address & HPAGE_PMD_MASK; - pte_t *pte; if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) - goto out; + return VM_FAULT_FALLBACK; if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; if (unlikely(khugepaged_enter(vma))) @@ -803,7 +802,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(!zero_page)) { pte_free(mm, pgtable); count_vm_event(THP_FAULT_FALLBACK); - goto out; + return VM_FAULT_FALLBACK; } spin_lock(&mm->page_table_lock); set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd, @@ -819,40 +818,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, vma, haddr, numa_node_id(), 0); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); - goto out; + return VM_FAULT_FALLBACK; } count_vm_event(THP_FAULT_ALLOC); if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) { put_page(page); - goto out; + return VM_FAULT_FALLBACK; } if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) { mem_cgroup_uncharge_page(page); put_page(page); - goto out; + return VM_FAULT_FALLBACK; } return 0; -out: - /* - * Use __pte_alloc instead of pte_alloc_map, because we can't - * run pte_offset_map on the pmd, if an huge pmd could - * materialize from under us from a different thread. - */ - if (unlikely(pmd_none(*pmd)) && - unlikely(__pte_alloc(mm, vma, pmd, address))) - return VM_FAULT_OOM; - /* if an huge pmd materialized from under us just retry later */ - if (unlikely(pmd_trans_huge(*pmd))) - return 0; - /* - * A regular pmd is established and it can't morph into a huge pmd - * from under us anymore at this point because we hold the mmap_sem - * read mode and khugepaged takes it in write mode. So now it's - * safe to run pte_offset_map(). - */ - pte = pte_offset_map(pmd, address); - return handle_pte_fault(mm, vma, address, pte, pmd, flags); } int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, diff --git a/mm/memory.c b/mm/memory.c index 5ec6f199e685..ca0003947115 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3695,7 +3695,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. */ -int handle_pte_fault(struct mm_struct *mm, +static int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *pte, pmd_t *pmd, unsigned int flags) { @@ -3774,9 +3774,12 @@ retry: if (!pmd) return VM_FAULT_OOM; if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { + int ret = VM_FAULT_FALLBACK; if (!vma->vm_ops) - return do_huge_pmd_anonymous_page(mm, vma, address, - pmd, flags); + ret = do_huge_pmd_anonymous_page(mm, vma, address, + pmd, flags); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; } else { pmd_t orig_pmd = *pmd; int ret; -- cgit v1.2.3