From cd38b115d5ad79b0100ac6daa103c4fe2c50a913 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 25 Jul 2011 17:12:29 -0700 Subject: mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim There have been a small number of complaints about significant stalls while copying large amounts of data on NUMA machines reported on a distribution bugzilla. In these cases, zone_reclaim was enabled by default due to large NUMA distances. In general, the complaints have not been about the workload itself unless it was a file server (in which case the recommendation was disable zone_reclaim). The stalls are mostly due to significant amounts of time spent scanning the preferred zone for pages to free. After a failure, it might fallback to another node (as zonelists are often node-ordered rather than zone-ordered) but stall quickly again when the next allocation attempt occurs. In bad cases, each page allocated results in a full scan of the preferred zone. Patch 1 checks the preferred zone for recent allocation failure which is particularly important if zone_reclaim has failed recently. This avoids rescanning the zone in the near future and instead falling back to another node. This may hurt node locality in some cases but a failure to zone_reclaim is more expensive than a remote access. Patch 2 clears the zlc information after direct reclaim. Otherwise, zone_reclaim can mark zones full, direct reclaim can reclaim enough pages but the zone is still not considered for allocation. This was tested on a 24-thread 2-node x86_64 machine. The tests were focused on large amounts of IO. All tests were bound to the CPUs on node-0 to avoid disturbances due to processes being scheduled on different nodes. The kernels tested are 3.0-rc6-vanilla Vanilla 3.0-rc6 zlcfirst Patch 1 applied zlcreconsider Patches 1+2 applied FS-Mark ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288 fsmark-3.0-rc6 3.0-rc6 3.0-rc6 vanilla zlcfirs zlcreconsider Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%) Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%) Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%) Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%) Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%) Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%) Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%) Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 501.49 493.91 499.93 Total Elapsed Time (seconds) 2451.57 2257.48 2215.92 MMTests Statistics: vmstat Page Ins 46268 63840 66008 Page Outs 90821596 90671128 88043732 Swap Ins 0 0 0 Swap Outs 0 0 0 Direct pages scanned 13091697 8966863 8971790 Kswapd pages scanned 0 1830011 1831116 Kswapd pages reclaimed 0 1829068 1829930 Direct pages reclaimed 13037777 8956828 8648314 Kswapd efficiency 100% 99% 99% Kswapd velocity 0.000 810.643 826.346 Direct efficiency 99% 99% 96% Direct velocity 5340.128 3972.068 4048.788 Percentage direct scans 100% 83% 83% Page writes by reclaim 0 3 0 Slabs scanned 796672 720640 720256 Direct inode steals 7422667 7160012 7088638 Kswapd inode steals 0 1736840 2021238 Test completes far faster with a large increase in the number of files created per second. Standard deviation is high as a small number of iterations were much higher than the mean. The number of pages scanned by zone_reclaim is reduced and kswapd is used for more work. LARGE DD 3.0-rc6 3.0-rc6 3.0-rc6 vanilla zlcfirst zlcreconsider download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%) dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%) delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 125.03 118.98 122.01 Total Elapsed Time (seconds) 624.56 375.02 398.06 MMTests Statistics: vmstat Page Ins 3594216 439368 407032 Page Outs 23380832 23380488 23377444 Swap Ins 0 0 0 Swap Outs 0 436 287 Direct pages scanned 17482342 69315973 82864918 Kswapd pages scanned 0 519123 575425 Kswapd pages reclaimed 0 466501 522487 Direct pages reclaimed 5858054 2732949 2712547 Kswapd efficiency 100% 89% 90% Kswapd velocity 0.000 1384.254 1445.574 Direct efficiency 33% 3% 3% Direct velocity 27991.453 184832.737 208171.929 Percentage direct scans 100% 99% 99% Page writes by reclaim 0 5082 13917 Slabs scanned 17280 29952 35328 Direct inode steals 115257 1431122 332201 Kswapd inode steals 0 0 979532 This test downloads a large tarfile and copies it with dd a number of times - similar to the most recent bug report I've dealt with. Time to completion is reduced. The number of pages scanned directly is still disturbingly high with a low efficiency but this is likely due to the number of dirty pages encountered. The figures could probably be improved with more work around how kswapd is used and how dirty pages are handled but that is separate work and this result is significant on its own. Streaming Mapped Writer MMTests Statistics: duration User/Sys Time Running Test (seconds) 124.47 111.67 112.64 Total Elapsed Time (seconds) 2138.14 1816.30 1867.56 MMTests Statistics: vmstat Page Ins 90760 89124 89516 Page Outs 121028340 120199524 120736696 Swap Ins 0 86 55 Swap Outs 0 0 0 Direct pages scanned 114989363 96461439 96330619 Kswapd pages scanned 56430948 56965763 57075875 Kswapd pages reclaimed 27743219 27752044 27766606 Direct pages reclaimed 49777 46884 36655 Kswapd efficiency 49% 48% 48% Kswapd velocity 26392.541 31363.631 30561.736 Direct efficiency 0% 0% 0% Direct velocity 53780.091 53108.759 51581.004 Percentage direct scans 67% 62% 62% Page writes by reclaim 385 122 1513 Slabs scanned 43008 39040 42112 Direct inode steals 0 10 8 Kswapd inode steals 733 534 477 This test just creates a large file mapping and writes to it linearly. Time to completion is again reduced. The gains are mostly down to two things. In many cases, there is less scanning as zone_reclaim simply gives up faster due to recent failures. The second reason is that memory is used more efficiently. Instead of scanning the preferred zone every time, the allocator falls back to another zone and uses it instead improving overall memory utilisation. This patch: initialise ZLC for first zone eligible for zone_reclaim. The zonelist cache (ZLC) is used among other things to record if zone_reclaim() failed for a particular zone recently. The intention is to avoid a high cost scanning extremely long zonelists or scanning within the zone uselessly. Currently the zonelist cache is setup only after the first zone has been considered and zone_reclaim() has been called. The objective was to avoid a costly setup but zone_reclaim is itself quite expensive. If it is failing regularly such as the first eligible zone having mostly mapped pages, the cost in scanning and allocation stalls is far higher than the ZLC initialisation step. This patch initialises ZLC before the first eligible zone calls zone_reclaim(). Once initialised, it is checked whether the zone failed zone_reclaim recently. If it has, the zone is skipped. As the first zone is now being checked, additional care has to be taken about zones marked full. A zone can be marked "full" because it should not have enough unmapped pages for zone_reclaim but this is excessive as direct reclaim or kswapd may succeed where zone_reclaim fails. Only mark zones "full" after zone_reclaim fails if it failed to reclaim enough pages after scanning. Signed-off-by: Mel Gorman Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 35 ++++++++++++++++++++++------------- 1 file changed, 22 insertions(+), 13 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9119faae6e6a..830a465958de 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1664,7 +1664,7 @@ zonelist_scan: continue; if ((alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask)) - goto try_next_zone; + continue; BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { @@ -1676,17 +1676,36 @@ zonelist_scan: classzone_idx, alloc_flags)) goto try_this_zone; + if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) { + /* + * we do zlc_setup if there are multiple nodes + * and before considering the first zone allowed + * by the cpuset. + */ + allowednodes = zlc_setup(zonelist, alloc_flags); + zlc_active = 1; + did_zlc_setup = 1; + } + if (zone_reclaim_mode == 0) goto this_zone_full; + /* + * As we may have just activated ZLC, check if the first + * eligible zone has failed zone_reclaim recently. + */ + if (NUMA_BUILD && zlc_active && + !zlc_zone_worth_trying(zonelist, z, allowednodes)) + continue; + ret = zone_reclaim(zone, gfp_mask, order); switch (ret) { case ZONE_RECLAIM_NOSCAN: /* did not scan */ - goto try_next_zone; + continue; case ZONE_RECLAIM_FULL: /* scanned but unreclaimable */ - goto this_zone_full; + continue; default: /* did we reclaim enough */ if (!zone_watermark_ok(zone, order, mark, @@ -1703,16 +1722,6 @@ try_this_zone: this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); -try_next_zone: - if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) { - /* - * we do zlc_setup after the first zone is tried but only - * if there are multiple nodes make it worthwhile - */ - allowednodes = zlc_setup(zonelist, alloc_flags); - zlc_active = 1; - did_zlc_setup = 1; - } } if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) { -- cgit v1.2.3 From 76d3fbf8fbf6cc78ceb63549e0e0c5bc8a88f838 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 25 Jul 2011 17:12:30 -0700 Subject: mm: page allocator: reconsider zones for allocation after direct reclaim With zone_reclaim_mode enabled, it's possible for zones to be considered full in the zonelist_cache so they are skipped in the future. If the process enters direct reclaim, the ZLC may still consider zones to be full even after reclaiming pages. Reconsider all zones for allocation if direct reclaim returns successfully. Signed-off-by: Mel Gorman Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 830a465958de..094472377d81 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) set_bit(i, zlc->fullzones); } +/* + * clear all zones full, called after direct reclaim makes progress so that + * a zone that was recently full is not skipped over for up to a second + */ +static void zlc_clear_zones_full(struct zonelist *zonelist) +{ + struct zonelist_cache *zlc; /* cached zonelist speedup info */ + + zlc = zonelist->zlcache_ptr; + if (!zlc) + return; + + bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); +} + #else /* CONFIG_NUMA */ static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) @@ -1632,6 +1647,10 @@ static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) { } + +static void zlc_clear_zones_full(struct zonelist *zonelist) +{ +} #endif /* CONFIG_NUMA */ /* @@ -1963,6 +1982,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, if (unlikely(!(*did_some_progress))) return NULL; + /* After successful reclaim, reconsider all zones for allocation */ + if (NUMA_BUILD) + zlc_clear_zones_full(zonelist); + retry: page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, -- cgit v1.2.3 From 7f5ddcc8d3eaccd5e169fda738530f937509645e Mon Sep 17 00:00:00 2001 From: Akinobu Mita Date: Tue, 26 Jul 2011 16:09:02 -0700 Subject: fault-injection: use debugfs_remove_recursive Use debugfs_remove_recursive() to simplify initialization and deinitialization of fault injection debugfs files. Signed-off-by: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fault-inject.h | 18 +------ lib/fault-inject.c | 115 ++++++++++--------------------------------- mm/failslab.c | 2 +- mm/page_alloc.c | 2 +- 4 files changed, 30 insertions(+), 107 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 7b72328cc8fe..a842db638380 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -27,23 +27,7 @@ struct fault_attr { unsigned long count; #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS - - struct { - struct dentry *dir; - - struct dentry *probability_file; - struct dentry *interval_file; - struct dentry *times_file; - struct dentry *space_file; - struct dentry *verbose_file; - struct dentry *task_filter_file; - struct dentry *stacktrace_depth_file; - struct dentry *require_start_file; - struct dentry *require_end_file; - struct dentry *reject_start_file; - struct dentry *reject_end_file; - } dentries; - + struct dentry *dir; #endif }; diff --git a/lib/fault-inject.c b/lib/fault-inject.c index 882fd3b6a6ad..2577b121c7c1 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -199,48 +199,7 @@ static struct dentry *debugfs_create_atomic_t(const char *name, mode_t mode, void cleanup_fault_attr_dentries(struct fault_attr *attr) { - debugfs_remove(attr->dentries.probability_file); - attr->dentries.probability_file = NULL; - - debugfs_remove(attr->dentries.interval_file); - attr->dentries.interval_file = NULL; - - debugfs_remove(attr->dentries.times_file); - attr->dentries.times_file = NULL; - - debugfs_remove(attr->dentries.space_file); - attr->dentries.space_file = NULL; - - debugfs_remove(attr->dentries.verbose_file); - attr->dentries.verbose_file = NULL; - - debugfs_remove(attr->dentries.task_filter_file); - attr->dentries.task_filter_file = NULL; - -#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER - - debugfs_remove(attr->dentries.stacktrace_depth_file); - attr->dentries.stacktrace_depth_file = NULL; - - debugfs_remove(attr->dentries.require_start_file); - attr->dentries.require_start_file = NULL; - - debugfs_remove(attr->dentries.require_end_file); - attr->dentries.require_end_file = NULL; - - debugfs_remove(attr->dentries.reject_start_file); - attr->dentries.reject_start_file = NULL; - - debugfs_remove(attr->dentries.reject_end_file); - attr->dentries.reject_end_file = NULL; - -#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */ - - if (attr->dentries.dir) - WARN_ON(!simple_empty(attr->dentries.dir)); - - debugfs_remove(attr->dentries.dir); - attr->dentries.dir = NULL; + debugfs_remove_recursive(attr->dir); } int init_fault_attr_dentries(struct fault_attr *attr, const char *name) @@ -248,66 +207,46 @@ int init_fault_attr_dentries(struct fault_attr *attr, const char *name) mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; struct dentry *dir; - memset(&attr->dentries, 0, sizeof(attr->dentries)); - dir = debugfs_create_dir(name, NULL); if (!dir) - goto fail; - attr->dentries.dir = dir; - - attr->dentries.probability_file = - debugfs_create_ul("probability", mode, dir, &attr->probability); + return -ENOMEM; - attr->dentries.interval_file = - debugfs_create_ul("interval", mode, dir, &attr->interval); + attr->dir = dir; - attr->dentries.times_file = - debugfs_create_atomic_t("times", mode, dir, &attr->times); - - attr->dentries.space_file = - debugfs_create_atomic_t("space", mode, dir, &attr->space); - - attr->dentries.verbose_file = - debugfs_create_ul("verbose", mode, dir, &attr->verbose); - - attr->dentries.task_filter_file = debugfs_create_bool("task-filter", - mode, dir, &attr->task_filter); - - if (!attr->dentries.probability_file || !attr->dentries.interval_file || - !attr->dentries.times_file || !attr->dentries.space_file || - !attr->dentries.verbose_file || !attr->dentries.task_filter_file) + if (!debugfs_create_ul("probability", mode, dir, &attr->probability)) + goto fail; + if (!debugfs_create_ul("interval", mode, dir, &attr->interval)) + goto fail; + if (!debugfs_create_atomic_t("times", mode, dir, &attr->times)) + goto fail; + if (!debugfs_create_atomic_t("space", mode, dir, &attr->space)) + goto fail; + if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose)) + goto fail; + if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter)) goto fail; #ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER - attr->dentries.stacktrace_depth_file = - debugfs_create_stacktrace_depth( - "stacktrace-depth", mode, dir, &attr->stacktrace_depth); - - attr->dentries.require_start_file = - debugfs_create_ul("require-start", mode, dir, &attr->require_start); - - attr->dentries.require_end_file = - debugfs_create_ul("require-end", mode, dir, &attr->require_end); - - attr->dentries.reject_start_file = - debugfs_create_ul("reject-start", mode, dir, &attr->reject_start); - - attr->dentries.reject_end_file = - debugfs_create_ul("reject-end", mode, dir, &attr->reject_end); - - if (!attr->dentries.stacktrace_depth_file || - !attr->dentries.require_start_file || - !attr->dentries.require_end_file || - !attr->dentries.reject_start_file || - !attr->dentries.reject_end_file) + if (!debugfs_create_stacktrace_depth("stacktrace-depth", mode, dir, + &attr->stacktrace_depth)) + goto fail; + if (!debugfs_create_ul("require-start", mode, dir, + &attr->require_start)) + goto fail; + if (!debugfs_create_ul("require-end", mode, dir, &attr->require_end)) + goto fail; + if (!debugfs_create_ul("reject-start", mode, dir, &attr->reject_start)) + goto fail; + if (!debugfs_create_ul("reject-end", mode, dir, &attr->reject_end)) goto fail; #endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */ return 0; fail: - cleanup_fault_attr_dentries(attr); + debugfs_remove_recursive(attr->dir); + return -ENOMEM; } diff --git a/mm/failslab.c b/mm/failslab.c index c5f88f240ddc..7df9f7f0abf1 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -45,7 +45,7 @@ static int __init failslab_debugfs_init(void) err = init_fault_attr_dentries(&failslab.attr, "failslab"); if (err) return err; - dir = failslab.attr.dentries.dir; + dir = failslab.attr.dir; failslab.ignore_gfp_wait_file = debugfs_create_bool("ignore-gfp-wait", mode, dir, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 094472377d81..72c6820a345c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1424,7 +1424,7 @@ static int __init fail_page_alloc_debugfs(void) "fail_page_alloc"); if (err) return err; - dir = fail_page_alloc.attr.dentries.dir; + dir = fail_page_alloc.attr.dir; fail_page_alloc.ignore_gfp_wait_file = debugfs_create_bool("ignore-gfp-wait", mode, dir, -- cgit v1.2.3 From b2588c4b4c3c075e9b45d61065d86c60de2b6441 Mon Sep 17 00:00:00 2001 From: Akinobu Mita Date: Tue, 26 Jul 2011 16:09:03 -0700 Subject: fail_page_alloc: simplify debugfs initialization Now cleanup_fault_attr_dentries() recursively removes a directory, So we can simplify the error handling in the initialization code and no need to hold dentry structs for each debugfs file. Signed-off-by: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 47 ++++++++++++++++------------------------------- 1 file changed, 16 insertions(+), 31 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 72c6820a345c..1dbcf8888f14 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1370,21 +1370,12 @@ failed: #ifdef CONFIG_FAIL_PAGE_ALLOC -static struct fail_page_alloc_attr { +static struct { struct fault_attr attr; u32 ignore_gfp_highmem; u32 ignore_gfp_wait; u32 min_order; - -#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS - - struct dentry *ignore_gfp_highmem_file; - struct dentry *ignore_gfp_wait_file; - struct dentry *min_order_file; - -#endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ - } fail_page_alloc = { .attr = FAULT_ATTR_INITIALIZER, .ignore_gfp_wait = 1, @@ -1424,30 +1415,24 @@ static int __init fail_page_alloc_debugfs(void) "fail_page_alloc"); if (err) return err; + dir = fail_page_alloc.attr.dir; - fail_page_alloc.ignore_gfp_wait_file = - debugfs_create_bool("ignore-gfp-wait", mode, dir, - &fail_page_alloc.ignore_gfp_wait); - - fail_page_alloc.ignore_gfp_highmem_file = - debugfs_create_bool("ignore-gfp-highmem", mode, dir, - &fail_page_alloc.ignore_gfp_highmem); - fail_page_alloc.min_order_file = - debugfs_create_u32("min-order", mode, dir, - &fail_page_alloc.min_order); - - if (!fail_page_alloc.ignore_gfp_wait_file || - !fail_page_alloc.ignore_gfp_highmem_file || - !fail_page_alloc.min_order_file) { - err = -ENOMEM; - debugfs_remove(fail_page_alloc.ignore_gfp_wait_file); - debugfs_remove(fail_page_alloc.ignore_gfp_highmem_file); - debugfs_remove(fail_page_alloc.min_order_file); - cleanup_fault_attr_dentries(&fail_page_alloc.attr); - } + if (!debugfs_create_bool("ignore-gfp-wait", mode, dir, + &fail_page_alloc.ignore_gfp_wait)) + goto fail; + if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir, + &fail_page_alloc.ignore_gfp_highmem)) + goto fail; + if (!debugfs_create_u32("min-order", mode, dir, + &fail_page_alloc.min_order)) + goto fail; + + return 0; +fail: + cleanup_fault_attr_dentries(&fail_page_alloc.attr); - return err; + return -ENOMEM; } late_initcall(fail_page_alloc_debugfs); -- cgit v1.2.3 From dd48c085c1cdf9446f92826f1fd451167fb6c2fd Mon Sep 17 00:00:00 2001 From: Akinobu Mita Date: Wed, 3 Aug 2011 16:21:01 -0700 Subject: fault-injection: add ability to export fault_attr in arbitrary directory init_fault_attr_dentries() is used to export fault_attr via debugfs. But it can only export it in debugfs root directory. Per Forlin is working on mmc_fail_request which adds support to inject data errors after a completed host transfer in MMC subsystem. The fault_attr for mmc_fail_request should be defined per mmc host and export it in debugfs directory per mmc host like /sys/kernel/debug/mmc0/mmc_fail_request. init_fault_attr_dentries() doesn't help for mmc_fail_request. So this introduces fault_create_debugfs_attr() which is able to create a directory in the arbitrary directory and replace init_fault_attr_dentries(). [akpm@linux-foundation.org: extraneous semicolon, per Randy] Signed-off-by: Akinobu Mita Tested-by: Per Forlin Cc: Jens Axboe Cc: Christoph Lameter Cc: Pekka Enberg Cc: Matt Mackall Cc: Randy Dunlap Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/fault-injection/fault-injection.txt | 3 +-- block/blk-core.c | 6 ++++-- block/blk-timeout.c | 5 ++++- include/linux/fault-inject.h | 18 +++++------------- lib/fault-inject.c | 20 +++++++------------- mm/failslab.c | 14 +++++++------- mm/page_alloc.c | 13 +++++-------- 7 files changed, 33 insertions(+), 46 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 7be15e44d481..82a5d250d75e 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -143,8 +143,7 @@ o provide a way to configure fault attributes failslab, fail_page_alloc, and fail_make_request use this way. Helper functions: - init_fault_attr_dentries(entries, attr, name); - void cleanup_fault_attr_dentries(entries); + fault_create_debugfs_attr(name, parent, attr); - module parameters diff --git a/block/blk-core.c b/block/blk-core.c index b850bedad229..b627558c461f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1368,8 +1368,10 @@ static bool should_fail_request(struct hd_struct *part, unsigned int bytes) static int __init fail_make_request_debugfs(void) { - return init_fault_attr_dentries(&fail_make_request, - "fail_make_request"); + struct dentry *dir = fault_create_debugfs_attr("fail_make_request", + NULL, &fail_make_request); + + return IS_ERR(dir) ? PTR_ERR(dir) : 0; } late_initcall(fail_make_request_debugfs); diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 4f0c06c7a338..780354888958 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -28,7 +28,10 @@ int blk_should_fake_timeout(struct request_queue *q) static int __init fail_io_timeout_debugfs(void) { - return init_fault_attr_dentries(&fail_io_timeout, "fail_io_timeout"); + struct dentry *dir = fault_create_debugfs_attr("fail_io_timeout", + NULL, &fail_io_timeout); + + return IS_ERR(dir) ? PTR_ERR(dir) : 0; } late_initcall(fail_io_timeout_debugfs); diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 3ff060ac7810..c6f996f2abb6 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -25,10 +25,6 @@ struct fault_attr { unsigned long reject_end; unsigned long count; - -#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS - struct dentry *dir; -#endif }; #define FAULT_ATTR_INITIALIZER { \ @@ -45,19 +41,15 @@ bool should_fail(struct fault_attr *attr, ssize_t size); #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS -int init_fault_attr_dentries(struct fault_attr *attr, const char *name); -void cleanup_fault_attr_dentries(struct fault_attr *attr); +struct dentry *fault_create_debugfs_attr(const char *name, + struct dentry *parent, struct fault_attr *attr); #else /* CONFIG_FAULT_INJECTION_DEBUG_FS */ -static inline int init_fault_attr_dentries(struct fault_attr *attr, - const char *name) -{ - return -ENODEV; -} - -static inline void cleanup_fault_attr_dentries(struct fault_attr *attr) +static inline struct dentry *fault_create_debugfs_attr(const char *name, + struct dentry *parent, struct fault_attr *attr) { + return ERR_PTR(-ENODEV); } #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ diff --git a/lib/fault-inject.c b/lib/fault-inject.c index 2577b121c7c1..f193b7796449 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -197,21 +197,15 @@ static struct dentry *debugfs_create_atomic_t(const char *name, mode_t mode, return debugfs_create_file(name, mode, parent, value, &fops_atomic_t); } -void cleanup_fault_attr_dentries(struct fault_attr *attr) -{ - debugfs_remove_recursive(attr->dir); -} - -int init_fault_attr_dentries(struct fault_attr *attr, const char *name) +struct dentry *fault_create_debugfs_attr(const char *name, + struct dentry *parent, struct fault_attr *attr) { mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; struct dentry *dir; - dir = debugfs_create_dir(name, NULL); + dir = debugfs_create_dir(name, parent); if (!dir) - return -ENOMEM; - - attr->dir = dir; + return ERR_PTR(-ENOMEM); if (!debugfs_create_ul("probability", mode, dir, &attr->probability)) goto fail; @@ -243,11 +237,11 @@ int init_fault_attr_dentries(struct fault_attr *attr, const char *name) #endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */ - return 0; + return dir; fail: - debugfs_remove_recursive(attr->dir); + debugfs_remove_recursive(dir); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ diff --git a/mm/failslab.c b/mm/failslab.c index 1ce58c201dca..0dd7b8fec71c 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -34,23 +34,23 @@ __setup("failslab=", setup_failslab); #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS static int __init failslab_debugfs_init(void) { + struct dentry *dir; mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; - int err; - err = init_fault_attr_dentries(&failslab.attr, "failslab"); - if (err) - return err; + dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr); + if (IS_ERR(dir)) + return PTR_ERR(dir); - if (!debugfs_create_bool("ignore-gfp-wait", mode, failslab.attr.dir, + if (!debugfs_create_bool("ignore-gfp-wait", mode, dir, &failslab.ignore_gfp_wait)) goto fail; - if (!debugfs_create_bool("cache-filter", mode, failslab.attr.dir, + if (!debugfs_create_bool("cache-filter", mode, dir, &failslab.cache_filter)) goto fail; return 0; fail: - cleanup_fault_attr_dentries(&failslab.attr); + debugfs_remove_recursive(dir); return -ENOMEM; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1dbcf8888f14..6e8ecb6e021c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1409,14 +1409,11 @@ static int __init fail_page_alloc_debugfs(void) { mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; struct dentry *dir; - int err; - err = init_fault_attr_dentries(&fail_page_alloc.attr, - "fail_page_alloc"); - if (err) - return err; - - dir = fail_page_alloc.attr.dir; + dir = fault_create_debugfs_attr("fail_page_alloc", NULL, + &fail_page_alloc.attr); + if (IS_ERR(dir)) + return PTR_ERR(dir); if (!debugfs_create_bool("ignore-gfp-wait", mode, dir, &fail_page_alloc.ignore_gfp_wait)) @@ -1430,7 +1427,7 @@ static int __init fail_page_alloc_debugfs(void) return 0; fail: - cleanup_fault_attr_dentries(&fail_page_alloc.attr); + debugfs_remove_recursive(dir); return -ENOMEM; } -- cgit v1.2.3 From 4f31888c104687078f8d88c2f11eca1080c88464 Mon Sep 17 00:00:00 2001 From: Dave Jones Date: Mon, 31 Oct 2011 17:07:24 -0700 Subject: mm: output a list of loaded modules when we hit bad_page() When we get a bad_page bug report, it's useful to see what modules the user had loaded. Signed-off-by: Dave Jones Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e8ecb6e021c..83a02052bce4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -318,6 +318,7 @@ static void bad_page(struct page *page) current->comm, page_to_pfn(page)); dump_page(page); + print_modules(); dump_stack(); out: /* Leave bad fields for debug, except PageBuddy could make trouble */ -- cgit v1.2.3 From 3ee9a4f086716d792219c021e8509f91165a4128 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 31 Oct 2011 17:08:35 -0700 Subject: mm: neaten warn_alloc_failed Add __attribute__((format (printf...) to the function to validate format and arguments. Use vsprintf extension %pV to avoid any possible message interleaving. Coalesce format string. Convert printks/pr_warning to pr_warn. [akpm@linux-foundation.org: use the __printf() macro] Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 3 ++- mm/page_alloc.c | 16 +++++++++++----- mm/vmalloc.c | 4 ++-- 3 files changed, 15 insertions(+), 8 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index 7438071b44aa..3b3e3b8bb706 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1334,7 +1334,8 @@ extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); extern int after_bootmem; -extern void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...); +extern __printf(3, 4) +void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...); extern void setup_per_cpu_pageset(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 83a02052bce4..9dd443d89d8b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1754,7 +1754,6 @@ static DEFINE_RATELIMIT_STATE(nopage_rs, void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) { - va_list args; unsigned int filter = SHOW_MEM_FILTER_NODES; if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) @@ -1773,14 +1772,21 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) filter &= ~SHOW_MEM_FILTER_NODES; if (fmt) { - printk(KERN_WARNING); + struct va_format vaf; + va_list args; + va_start(args, fmt); - vprintk(fmt, args); + + vaf.fmt = fmt; + vaf.va = &args; + + pr_warn("%pV", &vaf); + va_end(args); } - pr_warning("%s: page allocation failure: order:%d, mode:0x%x\n", - current->comm, order, gfp_mask); + pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n", + current->comm, order, gfp_mask); dump_stack(); if (!should_suppress_show_mem()) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 56faf3163ee2..08ab0aa1406c 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1593,8 +1593,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, return area->addr; fail: - warn_alloc_failed(gfp_mask, order, "vmalloc: allocation failure, " - "allocated %ld of %ld bytes\n", + warn_alloc_failed(gfp_mask, order, + "vmalloc: allocation failure, allocated %ld of %ld bytes\n", (area->nr_pages*PAGE_SIZE), area->size); vfree(area->addr); return NULL; -- cgit v1.2.3