summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2012-12-11mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalableIngo Molnar2-4/+15
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm/rmap: Convert the struct anon_vma::mutex to an rwsemIngo Molnar1-8/+8
Convert the struct anon_vma::mutex to an rwsem, which will help in solving a page-migration scalability problem. (Addressed in a separate patch.) The conversion is simple and straightforward: in every case where we mutex_lock()ed we'll now down_write(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case ↵Mel Gorman1-7/+9
build fix Commit "Add THP migration for the NUMA working set scanning fault case" breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to explode without CONFIG_TRANSPARENT_HUGEPAGE: mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put': mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed CONFIG_NUMA_BALANCING allows compilation without enabling transparent hugepages, so define the dummy function for such a configuration and only define migrate_misplaced_transhuge_page_put() when transparent hugepages are enabled. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case.Mel Gorman1-0/+15
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Delay PTE scanning until a task is scheduled on a new nodeMel Gorman1-0/+10
Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancingMel Gorman1-0/+4
This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrateMel Gorman2-2/+6
The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Introduce last_nid to the page frameMel Gorman2-0/+34
This patch introduces a last_nid field to the page struct. This is used to build a two-stage filter in the next patch that is aimed at mitigating a problem whereby pages migrate to the wrong node when referenced by a process that was running off its home node. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit setting of pte_numa if node is saturatedMel Gorman1-0/+6
If there are a large number of NUMA hinting faults and all of them are resulting in migrations it may indicate that memory is just bouncing uselessly around. NUMA balancing cost is likely exceeding any benefit from locality. Rate limit the PTE updates if the node is migration rate-limited. As noted in the comments, this distorts the NUMA faulting statistics. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli1-0/+13
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Migrate on reference policyMel Gorman1-0/+1
This is the simplest possible policy that still does something of note. When a pte_numa is faulted, it is moved immediately. Any replacement policy must at least do better than this and in all likelihood this policy regresses normal workloads. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add pte updates, hinting and migration statsMel Gorman2-0/+14
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: sched: numa: Implement slow start for working set samplingPeter Zijlstra1-0/+1
Add a 1 second delay before starting to scan the working set of a task and starting to balance it amongst nodes. [ note that before the constant per task WSS sampling rate patch the initial scan would happen much later still, in effect that patch caused this regression. ] The theory is that short-run tasks benefit very little from NUMA placement: they come and go, and they better stick to the node they were started on. As tasks mature and rebalance to other CPUs and nodes, so does their NUMA placement have to change and so does it start to matter more and more. In practice this change fixes an observable kbuild regression: # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ] !NUMA: 45.291088843 seconds time elapsed ( +- 0.40% ) 45.154231752 seconds time elapsed ( +- 0.36% ) +NUMA, no slow start: 46.172308123 seconds time elapsed ( +- 0.30% ) 46.343168745 seconds time elapsed ( +- 0.25% ) +NUMA, 1 sec slow start: 45.224189155 seconds time elapsed ( +- 0.25% ) 45.160866532 seconds time elapsed ( +- 0.17% ) and it also fixes an observable perf bench (hackbench) regression: # perf stat --null --repeat 10 perf bench sched messaging -NUMA: -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% ) +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% ) +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% ) The implementation is simple and straightforward, most of the patch deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable knob. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote the changelog, ran measurements, tuned the default. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) ratePeter Zijlstra2-0/+4
Previously, to probe the working set of a task, we'd use a very simple and crude method: mark all of its address space PROT_NONE. That method has various (obvious) disadvantages: - it samples the working set at dissimilar rates, giving some tasks a sampling quality advantage over others. - creates performance problems for tasks with very large working sets - over-samples processes with large address spaces but which only very rarely execute Improve that method by keeping a rotating offset into the address space that marks the current position of the scan, and advance it by a constant rate (in a CPU cycles execution proportional manner). If the offset reaches the last mapped address of the mm then it then it starts over at the first address. The per-task nature of the working set sampling functionality in this tree allows such constant rate, per task, execution-weight proportional sampling of the working set, with an adaptive sampling interval/frequency that goes from once per 100ms up to just once per 8 seconds. The current sampling volume is 256 MB per interval. As tasks mature and converge their working set, so does the sampling rate slow down to just a trickle, 256 MB per 8 seconds of CPU time executed. This, beyond being adaptive, also rate-limits rarely executing systems and does not over-sample on overloaded systems. [ In AutoNUMA speak, this patch deals with the effective sampling rate of the 'hinting page fault'. AutoNUMA's scanning is currently rate-limited, but it is also fundamentally single-threaded, executing in the knuma_scand kernel thread, so the limit in AutoNUMA is global and does not scale up with the number of CPUs, nor does it scan tasks in an execution proportional manner. So the idea of rate-limiting the scanning was first implemented in the AutoNUMA tree via a global rate limit. This patch goes beyond that by implementing an execution rate proportional working set sampling rate that is not implemented via a single global scanning daemon. ] [ Dan Carpenter pointed out a possible NULL pointer dereference in the first version of this patch. ] Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote changelog and fixed bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add fault driven placement and migrationPeter Zijlstra2-0/+31
NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for nowMel Gorman1-3/+1
The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to explicitly request lazy migration is a good idea but the actual API has not been well reviewed and once released we have to support it. For now this patch prevents an application using the services. This will need to be revisited. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Implement change_prot_numa() in terms of change_protection()Mel Gorman2-3/+4
This patch converts change_prot_numa() to use change_protection(). As pte_numa and friends check the PTE bits directly it is necessary for change_protection() to use pmd_mknuma(). Hence the required modifications to change_protection() are a little clumsy but the end result is that most of the numa page table helpers are just one or two instructions. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Add MPOL_MF_LAZYLee Schermerhorn2-3/+15
NOTE: Once again there is a lot of patch stealing and the end result is sufficiently different that I had to drop the signed-offs. Will re-add if the original authors are ok with that. This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Use _PAGE_NUMA to migrate pagesMel Gorman1-4/+5
Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but sufficiently different that the signed-off-bys were dropped Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Introduce migrate_misplaced_page()Peter Zijlstra1-0/+11
Note: This was originally based on Peter's patch "mm/migrate: Introduce migrate_misplaced_page()" but borrows extremely heavily from Andrea's "autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection". The end result is barely recognisable so signed-offs had to be dropped. If original authors are ok with it, I'll re-add the signed-off-bys. Add migrate_misplaced_page() which deals with migrating pages from faults. Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Check for misplaced pageLee Schermerhorn2-0/+9
This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Add MPOL_NOOPLee Schermerhorn1-0/+1
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY flags, mbind() will map the pages PROT_NONE so that they will be migrated on the next touch. This allows an application to prepare for a new phase of operation where different regions of shared storage will be assigned to worker threads, w/o changing policy. Note that we could just use "default" policy in this case. However, this also allows an application to request that pages be migrated, only if necessary, to follow any arbitrary policy that might currently apply to a range of pages, without knowing the policy, or without specifying multiple mbind()s for ranges with different policies. [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ] Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Make MPOL_LOCAL a real policyPeter Zijlstra1-0/+1
Make MPOL_LOCAL a real and exposed policy such that applications that relied on the previous default behaviour can explicitly request it. Requested-by: Christoph Lameter <cl@linux.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Create basic numa page hinting infrastructureMel Gorman1-0/+10
Note: This patch started as "mm/mpol: Create special PROT_NONE infrastructure" and preserves the basic idea but steals *very* heavily from "autonuma: numa hinting page faults entry points" for the actual fault handlers without the migration parts. The end result is barely recognisable as either patch so all Signed-off and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with this version, I will re-add the signed-offs-by to reflect the history. In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PAGE_NUMA variant, we can then use the 'spurious' protection faults to drive our migrations from. The meaning of PAGE_NUMA depends on the architecture but on x86 it is effectively PROT_NONE. Actual PROT_NONE mappings will not generate these NUMA faults for the reason that the page fault code checks the permission on the VMA (and will throw a segmentation fault on actual PROT_NONE mappings), before it ever calls handle_mm_fault. [dhillf@gmail.com: Fix typo] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Support NUMA hinting page faults from gup/gup_fastAndrea Arcangeli1-0/+1
Introduce FOLL_NUMA to tell follow_page to check pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do so because it always invokes handle_mm_fault and retries the follow_page later. KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault. Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages. [ This patch was picked up from the AutoNUMA tree. ] Originally-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ ported to this tree. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: pte_numa() and pmd_numa()Andrea Arcangeli1-0/+106
Implement pte_numa and pmd_numa. We must atomically set the numa bit and clear the present bit to define a pte_numa or pmd_numa. Once a pte or pmd has been set as pte_numa or pmd_numa, the next time a thread touches a virtual address in the corresponding virtual range, a NUMA hinting page fault will trigger. The NUMA hinting page fault will clear the NUMA bit and set the present bit again to resolve the page fault. The expectation is that a NUMA hinting page fault is used as part of a placement policy that decides if a page should remain on the current node or migrated to a different node. Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: compaction: Add scanned and isolated counters for compactionMel Gorman1-0/+2
Compaction already has tracepoints to count scanned and isolated pages but it requires that ftrace be enabled and if that information has to be written to disk then it can be disruptive. This patch adds vmstat counters for compaction called compact_migrate_scanned, compact_free_scanned and compact_isolated. With these counters, it is possible to define a basic cost model for compaction. This approximates of how much work compaction is doing and can be compared that with an oprofile showing TLB misses and see if the cost of compaction is being offset by THP for example. Minimally a compaction patch can be evaluated in terms of whether it increases or decreases cost. The basic cost model looks like this Fundamental unit u: a word sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2 Cmf = Cost migrate failure = Ca * 2 Ci = Cost page isolation = (Ca + Wi) where Wi is a constant that should reflect the approximate cost of the locking operation. Csm = Cost migrate scanning = Ca Csf = Cost free scanning = Ca Overall cost = (Csm * compact_migrate_scanned) + (Csf * compact_free_scanned) + (Ci * compact_isolated) + (Cmc * pgmigrate_success) + (Cmf * pgmigrate_failed) Where the values are read from /proc/vmstat. This is very basic and ignores certain costs such as the allocation cost to do a migrate page copy but any improvement to the model would still use the same vmstat counters. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: migrate: Add a tracepoint for migrate_pagesMel Gorman2-2/+62
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: compaction: Move migration fail/success stats to migrate.cMel Gorman1-1/+3
The compact_pages_moved and compact_pagemigrate_failed events are convenient for determining if compaction is active and to what degree migration is succeeding but it's at the wrong level. Other users of migration may also want to know if migration is working properly and this will be particularly true for any automated NUMA migration. This patch moves the counters down to migration with the new events called pgmigrate_success and pgmigrate_fail. The compact_blocks_moved counter is removed because while it was useful for debugging initially, it's worthless now as no meaningful conclusions can be drawn from its value. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: Count the number of pages affected in change_protection()Peter Zijlstra2-2/+9
This will be used for three kinds of purposes: - to optimize mprotect() - to speed up working set scanning for working set areas that have not been touched - to more accurately scan per real working set No change in functionality from this patch. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11x86/mm: Introduce pte_accessible()Rik van Riel1-0/+4
We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that the pte is associated with a page. However, for TLB flushing purposes, we would like to know whether the pte points to an actually accessible page. This allows us to skip remote TLB flushes for pages that are not actually accessible. Fill in this method for x86 and provide a safe (but slower) method on other architectures. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixed-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org [ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-11-17Merge branch 'akpm' (Fixes from Andrew)Linus Torvalds3-5/+3
Merge misc fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (12 patches) revert "mm: fix-up zone present pages" tmpfs: change final i_blocks BUG to WARNING tmpfs: fix shmem_getpage_gfp() VM_BUG_ON mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" rapidio: fix kernel-doc warnings swapfile: fix name leak in swapoff memcg: fix hotplugged memory zone oops mips, arc: fix build failure memcg: oom: fix totalpages calculation for memory.swappiness==0 mm: fix build warning for uninitialized value mm: add anon_vma_lock to validate_mm()
2012-11-17revert "mm: fix-up zone present pages"Andrew Morton1-4/+0
Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-17rapidio: fix kernel-doc warningsRandy Dunlap1-0/+2
Fix rapidio kernel-doc warnings: Warning(drivers/rapidio/rio.c:415): No description found for parameter 'local' Warning(drivers/rapidio/rio.c:415): Excess function parameter 'lstart' description in 'rio_map_inb_region' Warning(include/linux/rio.h:290): No description found for parameter 'switches' Warning(include/linux/rio.h:290): No description found for parameter 'destid_table' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matt Porter <mporter@kernel.crashing.org> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-17memcg: fix hotplugged memory zone oopsHugh Dickins1-1/+1
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm, oom: reintroduce /proc/pid/oom_adjDavid Rientjes1-0/+9
This is mostly a revert of 01dc52ebdf47 ("oom: remove deprecated oom_adj") from Davidlohr Bueso. It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier kernels. It simply scales the value linearly when /proc/pid/oom_score_adj is written. The major difference is that its scheduled removal is no longer included in Documentation/feature-removal-schedule.txt. We do warn users with a single printk, though, to suggest the more powerful and supported /proc/pid/oom_score_adj interface. Reported-by: Artem S. Tashkinov <t.artem@lycos.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16Merge tag 'fixes-for-linus' of ↵Linus Torvalds1-0/+31
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "We've been sitting on this longer than we meant to due to travel and other activities, but the number of patches is luckily not that high. Biggest changes are from a batch of OMAP bugfixes, but there are a few for the broader set of SoCs too (bcm2835, pxa, highbank, tegra, at91 and i.MX). The OMAP patches contain some fixes for MUSB/PHY on omap4 which ends up being a bit on the large side but needed for legacy (non-DT) platforms. Beyond that there are a handful of hwmod/pm changes. So, fairly noncontroversial stuff all in all, and as usual around this time the fixes are well targeted at specific problems." * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: imx: ehci: fix host power mask bit ARM i.MX: fix error-valued pointer dereference in clk_register_gate2() ARM: at91/usbh: fix overcurrent gpio setup ARM: at91/AT91SAM9G45: fix crypto peripherals irq issue due to sparse irq support ARM: boot: Fix usage of kecho ARM: OMAP: ocp2scp: create omap device for ocp2scp ARM: OMAP4: add _dev_attr_ to ocp2scp for representing usb_phy drivers: bus: ocp2scp: add pdata support irqchip: irq-bcm2835: Add terminating entry for of_device_id table ARM: highbank: retry wfi on reset request ARM: OMAP4: PM: fix regulator name for VDD_MPU ARM: OMAP4: hwmod data: do not enable or reset the McPDM during kernel init ARM: OMAP2+: hwmod: add flag to prevent hwmod code from touching IP block during init ARM: dt: tegra: fix length of pad control and mux registers ARM: OMAP: hwmod: wait for sysreset complete after enabling hwmod ARM: OMAP2+: clockdomain: Fix OMAP4 ISS clk domain to support only SWSUP ARM: pxa/spitz_pm: Fix hang when resuming from STR ARM: pxa: hx4700: Fix backlight PWM device number ARM: OMAP2+: PM: add missing newline to VC warning message
2012-11-16Merge tag 'imx-fixes-rc' of git://git.pengutronix.de/git/imx/linux-2.6 into ↵Arnd Bergmann6-7/+40
fixes From Sascha Hauer <s.hauer@pengutronix.de>: ARM i.MX fixes for 3.7-rc * tag 'imx-fixes-rc' of git://git.pengutronix.de/git/imx/linux-2.6: ARM: imx: ehci: fix host power mask bit ARM i.MX: fix error-valued pointer dereference in clk_register_gate2() Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2012-11-15clk: remove inline usage from clk-provider.hIgor Mazanov1-2/+2
Users of GCC 4.7 have reported compiler errors due to having inline applied to function declarations in clk-provider.h. The definitions exist in drivers/clk/clk.c. An example error: In file included from arch/arm/mach-omap2/clockdomain.c:25:0: arch/arm/mach-omap2/clockdomain.c: In function ‘clkdm_clk_disable’: include/linux/clk-provider.h:338:12: error: inlining failed in call to always_inline ‘__clk_get_enable_count’: function body not available arch/arm/mach-omap2/clockdomain.c:1001:28: error: called from here make[1]: *** [arch/arm/mach-omap2/clockdomain.o] Error 1 make: *** [arch/arm/mach-omap2] Error 2 This patch removes the use of inline from include/linux/clk-provider.h but keeps the function definitions in drivers/clk/clk.c as inlined since they are one-liners. Signed-off-by: Igor Mazanov <i.mazanov@gmail.com> Acked-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Mike Turquette <mturquette@linaro.org> [mturquette@linaro.org: improved subject, added changelog]
2012-11-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+2
Pull networking fixes from David Miller: "Bug fixes galore, mostly in drivers as is often the case: 1) USB gadget and cdc_eem drivers need adjustments to their frame size lengths in order to handle VLANs correctly. From Ian Coolidge. 2) TIPC and several network drivers erroneously call tasklet_disable before tasklet_kill, fix from Xiaotian Feng. 3) r8169 driver needs to apply the WOL suspend quirk to more chipsets, fix from Cyril Brulebois. 4) Fix multicast filters on RTL_GIGA_MAC_VER_35 r8169 chips, from Nathan Walp. 5) FDB netlink dumps should use RTM_NEWNEIGH as the message type, not zero. From John Fastabend. 6) Fix smsc95xx tx checksum offload on big-endian, from Steve Glendinning. 7) __inet_diag_dump() needs to repsect and report the error value returned from inet_diag_lock_handler() rather than ignore it. Otherwise if an inet diag handler is not available for a particular protocol, we essentially report success instead of giving an error indication. Fix from Cyrill Gorcunov. 8) When the QFQ packet scheduler sees TSO/GSO packets it does not handle things properly, and in fact ends up corrupting it's datastructures as well as mis-schedule packets. Fix from Paolo Valente. 9) Fix oopser in skb_loop_sk(), from Eric Leblond. 10) CXGB4 passes partially uninitialized datastructures in to FW commands, fix from Vipul Pandya. 11) When we send unsolicited ipv6 neighbour advertisements, we should send them to the link-local allnodes multicast address, as per RFC4861. Fix from Hannes Frederic Sowa. 12) There is some kind of bug in the usbnet's kevent deferral mechanism, but more immediately when it triggers an uncontrolled stream of kernel messages spam the log. Rate limit the error log message triggered when this problem occurs, as sending thousands of error messages into the kernel log doesn't help matters at all, and in fact makes further diagnosis more difficult. From Steve Glendinning. 13) Fix gianfar restore from hibernation, from Wang Dongsheng. 14) The netlink message attribute sizes are wrong in the ipv6 GRE driver, it was using the size of ipv4 addresses instead of ipv6 ones :-) Fix from Nicolas Dichtel." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: gre6: fix rtnl dump messages gianfar: ethernet vanishes after restoring from hibernation usbnet: ratelimit kevent may have been dropped warnings ipv6: send unsolicited neighbour advertisements to all-nodes net: usb: cdc_eem: Fix rx skb allocation for 802.1Q VLANs usb: gadget: g_ether: fix frame size check for 802.1Q cxgb4: Fix initialization of SGE_CONTROL register isdn: Make CONFIG_ISDN depend on CONFIG_NETDEVICES cxgb4: Initialize data structures before using. af-packet: fix oops when socket is not present pkt_sched: enable QFQ to support TSO/GSO net: inet_diag -- Return error code if protocol handler is missed net: bnx2x: Fix typo in bnx2x driver smsc95xx: fix tx checksum offload for big endian rtnetlink: Use nlmsg type RTM_NEWNEIGH from dflt fdb dump ptp: update adjfreq callback description r8169: allow multicast packets on sub-8168f chipset. r8169: Fix WoL on RTL8168d/8111d. drivers/net: use tasklet_kill in device remove/close process tipc: do not use tasklet_disable before tasklet_kill
2012-11-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds1-0/+2
Pull sparc fixes from David Miller: "Several build/bug fixes for sparc, including: 1) Configuring a mix of static vs. modular sparc64 crypto modules didn't work, remove an ill-conceived attempt to only have to build the device match table for these drivers once to fix the problem. Reported by Meelis Roos. 2) Make the montgomery multiple/square and mpmul instructions actually usable in 32-bit tasks. Essentially this involves providing 32-bit userspace with a way to use a 64-bit stack when it needs to. 3) Our sparc64 atomic backoffs don't yield cpu strands properly on Niagara chips. Use pause instruction when available to achieve this, otherwise use a benign instruction we know blocks the strand for some time. 4) Wire up kcmp 5) Fix the build of various drivers by removing the unnecessary blocking of OF_GPIO when SPARC. 6) Fix unintended regression wherein of_address_to_resource stopped being provided. Fix from Andreas Larsson. 7) Fix NULL dereference in leon_handle_ext_irq(), also from Andreas Larsson." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Fix build with mix of modular vs. non-modular crypto drivers. sparc: Support atomic64_dec_if_positive properly. of/address: sparc: Declare of_address_to_resource() as an extern function for sparc again sparc32, leon: Check for existent irq_map entry in leon_handle_ext_irq sparc: Add sparc support for platform_get_irq() sparc: Allow OF_GPIO on sparc. qlogicpti: Fix build warning. sparc: Wire up sys_kcmp. sparc64: Improvde documentation and readability of atomic backoff code. sparc64: Use pause instruction when available. sparc64: Fix cpu strand yielding. sparc64: Make montmul/montsqr/mpmul usable in 32-bit threads.
2012-11-10Merge tag 'stable/for-linus-3.7-rc5-tag' of ↵Linus Torvalds1-2/+32
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen fixes from Konrad Rzeszutek Wilk: "There are three ARM compile fixes (we forgot to export certain functions and if the drivers are built as an module - we go belly-up). There is also an mismatch of irq_enter() / exit_idle() calls sequence which were fixed some time ago in other piece of codes, but failed to appear in the Xen code. Lastly a fix for to help in the field with troubleshooting in case we cannot get the appropriate parameter and also fallback code when working with very old hypervisors." Bug-fixes: - Fix compile issues on ARM. - Fix hypercall fallback code for old hypervisors. - Print out which HVM parameter failed if it fails. - Fix idle notifier call after irq_enter. * tag 'stable/for-linus-3.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/arm: Fix compile errors when drivers are compiled as modules (export more). xen/arm: Fix compile errors when drivers are compiled as modules. xen/generic: Disable fallback build on ARM. xen/events: fix RCU warning, or Call idle notifier after irq_enter() xen/hvm: If we fail to fetch an HVM parameter print out which flag it is. xen/hypercall: fix hypercall fallback code for very old hypervisors
2012-11-10of/address: sparc: Declare of_address_to_resource() as an extern function ↵Andreas Larsson1-0/+2
for sparc again This bug-fix makes sure that of_address_to_resource is defined extern for sparc so that the sparc-specific implementation of of_address_to_resource() is once again used when including include/linux/of_address.h in a sparc context. A number of drivers in mainline relies on this function working for sparc. The bug was introduced in a850a7554442f08d3e910c6eeb4ee216868dda1e, "of/address: add empty static inlines for !CONFIG_OF". Contrary to that commit title, the static inlines are added for !CONFIG_OF_ADDRESS, and CONFIG_OF_ADDRESS is never defined for sparc. This is good behavior for the other functions in include/linux/of_address.h, as the extern functions defined in drivers/of/address.c only gets linked when OF_ADDRESS is configured. However, for of_address_to_resource there exists a sparc-specific implementation in arch/sparc/arch/sparc/kernel/of_device_common.c Solution suggested by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andreas Larsson <andreas@gaisler.com> Acked-by: Rob Herring <rob.herring@calxeda.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10Merge tag 'mmc-fixes-for-3.7-rc5' of ↵Linus Torvalds2-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC fixes from Chris Ball: - sdhci: fix a NULL dereference at resume-time, seen on OLPC XO-4 - sdhci: fix against 3.7-rc1 for UHS modes without a vqmmc regulator - sdhci-of-esdhc: disable CMD23 on boards where it's broken - sdhci-s3c: fix against 3.7-rc1 for card detection with runtime PM - dw_mmc, omap_hsmmc: fix potential NULL derefs, compiler warnings * tag 'mmc-fixes-for-3.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: mmc: sdhci-s3c: fix the card detection in runtime-pm mmc: sdhci-s3c: use clk_prepare_enable and clk_disable_unprepare mmc: dw_mmc: constify dw_mci_idmac_ops in exynos back-end mmc: dw_mmc: fix modular build for exynos back-end mmc: sdhci: fix NULL dereference in sdhci_request() tuning mmc: sdhci: fix IS_ERR() checking of regulator_get() mmc: fix sdhci-dove probe/removal mmc: sh_mmcif: fix use after free mmc: sdhci-pci: fix 'Invalid iomem size' error message condition mmc: mxcmmc: Fix MODULE_ALIAS mmc: omap_hsmmc: fix NULL pointer dereference for dt boot mmc: omap_hsmmc: fix host reference after mmc_free_host mmc: dw_mmc: fix multiple drv_data NULL dereferences mmc: dw_mmc: enable controller interrupt before calling mmc_start_host mmc: sdhci-of-esdhc: disable CMD23 for some Freescale SoCs mmc: dw_mmc: remove _dev_info compile warning mmc: dw_mmc: convert the variable type of irq
2012-11-09revert "epoll: support for disabling items, and a self-test app"Andrew Morton1-1/+0
Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a self-test app") pending resolution of the issues identified by Michael Kerrisk, copied below. We'll revisit this for 3.8. : I've taken a look at this patch as it currently stands in 3.7-rc1, and : done a bit of testing. (By the way, the test program : tools/testing/selftests/epoll/test_epoll.c does not compile...) : : There are one or two places where the behavior seems a little strange, : so I have a question or two at the end of this mail. But other than : that, I want to check my understanding so that the interface can be : correctly documented. : : Just to go though my understanding, the problem is the following : scenario in a multithreaded application: : : 1. Multiple threads are performing epoll_wait() operations, : and maintaining a user-space cache that contains information : corresponding to each file descriptor being monitored by : epoll_wait(). : : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL) : a file descriptor from the epoll interest list, and : delete the corresponding record from the user-space cache. : : 3. The problem with (2) is that some other thread may have : previously done an epoll_wait() that retrieved information : about the fd in question, and may be in the middle of using : information in the cache that relates to that fd. Thus, : there is a potential race. : : 4. The race can't solved purely in user space, because doing : so would require applying a mutex across the epoll_wait() : call, which would of course blow thread concurrency. : : Right? : : Your solution is the EPOLL_CTL_DISABLE operation. I want to : confirm my understanding about how to use this flag, since : the description that has accompanied the patches so far : has been a bit sparse : : 0. In the scenario you're concerned about, deleting a file : descriptor means (safely) doing the following: : (a) Deleting the file descriptor from the epoll interest list : using EPOLL_CTL_DEL : (b) Deleting the corresponding record in the user-space cache : : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in : conjunction with EPOLLONESHOT. : : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in : conjunction is a logical error. : : 3. The correct way to code multithreaded applications using : EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows: : : a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should : should EPOLLONESHOT. : : b. When a thread wants to delete a file descriptor, it : should do the following: : : [1] Call epoll_ctl(EPOLL_CTL_DISABLE) : [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE) : was zero, then the file descriptor can be safely : deleted by the thread that made this call. : [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY, : then the descriptor is in use. In this case, the calling : thread should set a flag in the user-space cache to : indicate that the thread that is using the descriptor : should perform the deletion operation. : : Is all of the above correct? : : The implementation depends on checking on whether : (events & ~EP_PRIVATE_BITS) == 0 : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be : cleared. : : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE : is only useful in conjunction with EPOLLONESHOT. However, as things : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does : not have EPOLLONESHOT set in 'events' This results in the following : (slightly surprising) behavior: : : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0 : (the indicator that the file descriptor can be safely deleted). : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY. : : This doesn't seem particularly useful, and in fact is probably an : indication that the user made a logic error: they should only be using : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which : EPOLLONESHOT was set in 'events'. If that is correct, then would it : not make sense to return an error to user space for this case? Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-08mmc: dw_mmc: constify dw_mci_idmac_ops in exynos back-endArnd Bergmann1-2/+2
The of_device_id match data is now marked as const and must not be modified. This changes the dw_mmc to mark all pointers passing the dw_mci_drv_data or dw_mci_dma_ops structures as const, and also marks the static definitions as const. drivers/mmc/host/dw_mmc-exynos.c: In function 'dw_mci_exynos_probe': drivers/mmc/host/dw_mmc-exynos.c:234:11: warning: assignment discards 'const' qualifier from pointer target type [enabled by default] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Abraham <thomas.abraham@linaro.org> Cc: Will Newton <will.newton@imgtec.com> Signed-off-by: Chris Ball <cjb@laptop.org>
2012-11-07mmc: sdhci-of-esdhc: disable CMD23 for some Freescale SoCsJerry Huang1-0/+1
CMD23 causes lots of errors in kernel on some freescale SoCs (P1020, P1021, P1022, P1024, P1025 and P4080) when MMC card used, which is because these controllers does not support CMD23, even on the SoCs which declares CMD23 is supported. Therefore, we'll not use CMD23. Signed-off-by: Jerry Huang <Chang-Ming.Huang@freescale.com> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com> Acked-by: Anton Vorontsov <cbouatmailru@gmail.com> Signed-off-by: Chris Ball <cjb@laptop.org>
2012-11-07mmc: dw_mmc: convert the variable type of irqSeungwon Jeon1-1/+1
Even though platform_get_irq returns error, 'host->irq' always has an unsigned value. Less-than-zero comparison of an unsigned value is never true. Type of 'unsigned int' will be changed for 'int'. Signed-off-by: Seungwon Jeon <tgih.jun@samsung.com> Acked-by: Will Newton <will.newton@imgtec.com> Signed-off-by: Chris Ball <cjb@laptop.org>
2012-11-07drivers: bus: ocp2scp: add pdata supportKishon Vijay Abraham I1-0/+31
ocp2scp was not having pdata support which makes *musb* fail for non-dt boot in OMAP platform. The pdata will have information about the devices that is connected to ocp2scp. ocp2scp driver will now make use of this information to create the devices that is attached to ocp2scp. This is needed to fix MUSB regression caused by commit c9e4412a (arm: omap: phy: remove unused functions from omap-phy-internal.c) Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Acked-by: Felipe Balbi <balbi@ti.com> [tony@atomide.com: updated comments for regression info] Signed-off-by: Tony Lindgren <tony@atomide.com>
2012-11-07xen/hvm: If we fail to fetch an HVM parameter print out which flag it is.Konrad Rzeszutek Wilk1-2/+32
Makes it easier to troubleshoot in the field. Acked-by: Ian Campbell <ian.campbell@citrix.com> [v1: Use macro per Ian's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>