summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-02-08uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into ↵Oleg Nesterov1-29/+6
trace_uprobe trace_uprobe->consumer and "struct uprobe_trace_consumer" add the unnecessary indirection and complicate the code for no reason. This patch simply embeds uprobe_consumer into "struct trace_uprobe", all other changes only fix the compilation errors. Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08uprobes/tracing: Introduce is_trace_uprobe_enabled()Oleg Nesterov2-3/+7
probe_event_enable/disable() check tu->consumer != NULL to avoid the wrong uprobe_register/unregister(). We are going to kill this pointer and "struct uprobe_trace_consumer", so we add the new helper, is_trace_uprobe_enabled(), which can rely on TP_FLAG_TRACE/TP_FLAG_PROFILE instead. Note: the current logic doesn't look optimal, it is not clear why TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive, we will probably change this later. Also kill the unused TP_FLAG_UPROBE. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes/tracing: Ensure inode != NULL in create_trace_uprobe()Oleg Nesterov1-3/+3
probe_event_enable/disable() check tu->inode != NULL at the start. This is ugly, if igrab() can fail create_trace_uprobe() should not succeed and "postpone" the failure. And S_ISREG(inode->i_mode) check added by d24d7dbf is not safe. Note: alloc_uprobe() should probably check igrab() != NULL as well. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes/tracing: Fully initialize uprobe_trace_consumer before uprobe_register()Oleg Nesterov1-6/+7
probe_event_enable() does uprobe_register() and only after that sets utc->tu and tu->consumer/flags. This can race with uprobe_dispatcher() which can miss these assignments or see them out of order. Nothing really bad can happen, but this doesn't look clean/safe. And this does not allow to use uprobe_consumer->filter() we are going to add, it is called by uprobe_register() and it needs utc->tu. Change this code to initialize everything before uprobe_register(), and reset tu->consumer/flags if it fails. We can't race with event_disable(), the caller holds event_mutex, and if we could the code would be wrong anyway. In fact I think uprobe_trace_consumer should die, it buys nothing but complicates the code. We can simply add uprobe_consumer into trace_uprobe. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes/tracing: Fix dentry/mount leak in create_trace_uprobe()Oleg Nesterov1-4/+6
create_trace_uprobe() does kern_path() to find ->d_inode, but forgets to do path_put(). We can do this right after igrab(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Add exports for module useJosh Stone2-0/+9
The original pull message for uprobes (commit 654443e2) noted: This tree includes uprobes support in 'perf probe' - but SystemTap (and other tools) can take advantage of user probe points as well. In order to actually be usable in module-based tools like SystemTap, the interface needs to be exported. This patch first adds the obvious exports for uprobe_register and uprobe_unregister. Then it also adds one for task_user_regset_view, which is necessary to get the correct state of userspace registers. Signed-off-by: Josh Stone <jistone@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08uprobes: Kill the bogus IS_ERR_VALUE(xol_vaddr) checkOleg Nesterov1-2/+1
utask->xol_vaddr is either zero or valid, remove the bogus IS_ERR_VALUE() check in xol_free_insn_slot(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Do not allocate current->utask unnecessaryOleg Nesterov1-10/+6
handle_swbp() does get_utask() before can_skip_sstep() for no reason, we do not need ->utask if can_skip_sstep() succeeds. Move get_utask() to pre_ssout() who actually starts to use it. Move the initialization of utask->active_uprobe/state as well. This way the whole initialization is consolidated in pre_ssout(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com>
2013-02-08uprobes: Fix utask->xol_vaddr leak in pre_ssout()Oleg Nesterov1-1/+8
pre_ssout() should do xol_free_insn_slot() if arch_uprobe_pre_xol() fails, otherwise nobody will free the allocated slot. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Do not play with utask in xol_get_insn_slot()Oleg Nesterov1-16/+21
pre_ssout()->xol_get_insn_slot() path is confusing and buggy. This patch cleanups the code, the next one fixes the bug. Change xol_get_insn_slot() to only allocate the slot and do nothing more, move the initialization of utask->xol_vaddr/vaddr into pre_ssout(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Turn add_utask() into get_utask()Oleg Nesterov1-18/+9
Rename add_utask() into get_utask() and change it to allocate on demand to simplify the caller. Like get_xol_area() it will have more users. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Fold xol_alloc_area() into get_xol_area()Oleg Nesterov1-22/+16
Currently only xol_get_insn_slot() does get_xol_area() + xol_alloc_area(), but this will have more users and we do not want to copy-and-paste this code. This patch simply moves xol_alloc_area() into get_xol_area() to simplify the current and future code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Move alloc_page() from xol_add_vma() to xol_alloc_area()Oleg Nesterov1-19/+13
Move alloc_page() from xol_add_vma() to xol_alloc_area() to cleanup the code. This separates the memory allocations and consolidates the -EALREADY cleanups and the error handling. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Change handle_swbp() to expose bp_vaddr to handler_chain()Oleg Nesterov2-10/+9
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is what consumer->handler() needs but uprobe_get_swbp_addr() is not exported. This also simplifies the code and makes it more consistent across the supported architectures. handle_swbp() becomes the only caller of uprobe_get_swbp_addr(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
2013-02-08uprobes: Teach handler_chain() to filter out the probed taskOleg Nesterov1-10/+48
Currrently the are 2 problems with pre-filtering: 1. It is not possible to add/remove a task (mm) after uprobe_register() 2. A forked child inherits all breakpoints and uprobe_consumer can not control this. This patch does the first step to improve the filtering. handler_chain() removes the breakpoints installed by this uprobe from current->mm if all handlers return UPROBE_HANDLER_REMOVE. Note that handler_chain() relies on ->register_rwsem to avoid the race with uprobe_register/unregister which can add/del a consumer, or even remove and then insert the new uprobe at the same address. Perhaps we will add uprobe_apply_mm(uprobe, mm, is_register) and teach copy_mm() to do filter(UPROBE_FILTER_FORK), but I think this change makes sense anyway. Note: instead of checking the retcode from uc->handler, we could add uc->filter(UPROBE_FILTER_BPHIT). But I think this is not optimal to call 2 hooks in a row. This buys nothing, and if handler/filter do something nontrivial they will probably do the same work twice. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Reintroduce uprobe_consumer->filter()Oleg Nesterov1-7/+11
Finally add uprobe_consumer->filter() and change consumer_filter() to actually call this method. Note that ->filter() accepts mm_struct, not task_struct. Because: 1. We do not have for_each_mm_user(mm, task). 2. Even if we implement for_each_mm_user(), ->filter() can use it itself. 3. It is not clear who will actually need this interface to do the "nontrivial" filtering. Another argument is "enum uprobe_filter_ctx", consumer->filter() can use it to figure out why/where it was called. For example, perhaps we can add UPROBE_FILTER_PRE_REGISTER used by build_map_info() to quickly "nack" the unwanted mm's. In this case consumer should know that it is called under ->i_mmap_mutex. See the previous discussion at http://marc.info/?t=135214229700002 Perhaps we should pass more arguments, vma/vaddr? Note: this patch obviously can't help to filter out the child created by fork(), this will be addressed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Rationalize the usage of filter_chain()Oleg Nesterov1-23/+21
filter_chain() was added into install_breakpoint/remove_breakpoint to simplify the initial changes but this is sub-optimal. This patch shifts the callsite to the callers, register_for_each_vma() and uprobe_mmap(). This way: - It will be easier to add the new arguments. This is the main reason, we can do more optimizations later. - register_for_each_vma(is_register => true) can be optimized, we only need to consult the new consumer. The previous consumers were already asked when they called uprobe_register(). This patch also moves the MMF_HAS_UPROBES check from remove_breakpoint(), this allows to avoid the potentionally costly filter_chain(). Note that register_for_each_vma(is_register => false) doesn't really need to take ->consumer_rwsem, but I don't think it makes sense to optimize this and introduce filter_chain_lockless(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill uprobes_mutex[], separate alloc_uprobe() and __uprobe_register()Oleg Nesterov1-36/+15
uprobe_register() and uprobe_unregister() are the only users of mutex_lock(uprobes_hash(inode)), and the only reason why we can't simply remove it is that we need to ensure that delete_uprobe() is not possible after alloc_uprobe() and before consumer_add(). IOW, we need to ensure that when we take uprobe->register_rwsem this uprobe is still valid and we didn't race with _unregister() which called delete_uprobe() in between. With this patch uprobe_register() simply checks uprobe_is_active() and retries if it hits this very unlikely race. uprobes_mutex[] is no longer needed and can be removed. There is another reason for this change, prepare_uprobe() should be folded into alloc_uprobe() and we do not want to hold the extra locks around read_mapping_page/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Introduce uprobe_is_active()Oleg Nesterov1-0/+8
The lifetime of uprobe->rb_node and uprobe->inode is not refcounted, delete_uprobe() is called when we detect that uprobe has no consumers, and it would be deadly wrong to do this twice. Change delete_uprobe() to WARN() if it was already called. We use RB_CLEAR_NODE() to mark uprobe "inactive", then RB_EMPTY_NODE() can be used to detect this case. RB_EMPTY_NODE() is not used directly, we add the trivial helper for the next change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill uprobe_events, use RB_EMPTY_ROOT() insteadOleg Nesterov1-12/+7
uprobe_events counts the number of uprobes in uprobes_tree but it is used as a boolean. We can use RB_EMPTY_ROOT() instead. Probably no_uprobe_events() added by this patch can have more callers, say, mmf_recalc_uprobes(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill uprobe->copy_mutexOleg Nesterov1-4/+3
Now that ->register_rwsem is safe under ->mmap_sem we can kill ->copy_mutex and abuse down_write(&uprobe->consumer_rwsem). This makes prepare_uprobe() even more ugly, but we should kill it anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill UPROBE_RUN_HANDLER flagOleg Nesterov1-18/+5
Simply remove UPROBE_RUN_HANDLER and the corresponding code. It can only help if uprobe has a single consumer, and in fact it is no longer needed after handler_chain() was changed to use ->register_rwsem, we simply can not race with uprobe_register(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Change filter_chain() to iterate ->consumers listOleg Nesterov1-8/+13
Now that it safe to use ->consumer_rwsem under ->mmap_sem we can almost finish the implementation of filter_chain(). It still lacks the actual uc->filter(...) call but othewrwise it is ready, just it pretends that ->filter() always returns true. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Introduce uprobe->register_rwsemOleg Nesterov1-2/+8
Introduce uprobe->register_rwsem. It is taken for writing around __uprobe_register/unregister. Change handler_chain() to use this sem rather than consumer_rwsem. The main reason for this change is that we have the nasty problem with mmap_sem/consumer_rwsem dependency. filter_chain() needs to protect uprobe->consumers like handler_chain(), but they can not use the same lock. filter_chain() can be called under ->mmap_sem (currently this is always true), but we want to allow ->handler() to play with the probed task's memory, and this needs ->mmap_sem. Alternatively we could use srcu, but synchronize_srcu() is very slow and ->register_rwsem allows us to do more. In particular, we can teach handler_chain() to do remove_breakpoint() if this bp is "nacked" by all consumers, we know that we can't race with the new consumer which does uprobe_register(). See also the next patches. uprobes_mutex[] is almost ready to die. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: _register() should always do register_for_each_vma(true)Oleg Nesterov1-18/+13
To support the filtering uprobe_register() should do register_for_each_vma(true) every time the new consumer comes, we need to install the previously nacked breakpoints. Note: - uprobes_mutex[] should die, what it actually protects is alloc_uprobe(). - UPROBE_RUN_HANDLER should die too, obviously it can't work unless uprobe has a single consumer. The consumer should serialize with _register/_unregister itself. Or this flag should live in uprobe_consumer->state. - Perhaps we can do some optimizations later. For example, if filter_chain() never returns false uprobe can record this fact and avoid the unnecessary register_for_each_vma(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: _unregister() should always do register_for_each_vma(false)Oleg Nesterov1-14/+14
uprobe_unregister() removes the breakpoints only if the last consumer goes away. To support the filtering it should do this every time, we want to remove the breakpoints which nobody else want to keep. Note: given that filter_chain() is not actually implemented, this patch itself doesn't change the behaviour yet, register_for_each_vma(false) is a heavy "nop" unless there are no more consumers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Introduce filter_chain()Oleg Nesterov1-5/+19
Add the new helper filter_chain(). Currently it is only placeholder, the comment explains what is should do. We will change it later to consult every consumer to decide whether we need to install the swbp. Until then it works as if any consumer returns true, this matches the current behavior. Change install_breakpoint() to call filter_chain() instead of checking uprobe->consumers != NULL. We obviously need this, and this equally closes the race with _unregister(). Change remove_breakpoint() to call this helper too. Currently this is pointless because remove_breakpoint() is only called when the last consumer goes away, but we will change this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill uprobe_consumer->filter()Oleg Nesterov2-5/+2
uprobe_consumer->filter() is pointless in its current form, kill it. We will add it back, but with the different signature/semantics. Perhaps we will even re-introduce the callsite in handler_chain(), but not to just skip uc->handler(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Kill the pointless inode/uc checks in register/unregisterOleg Nesterov1-6/+1
register/unregister verifies that inode/uc != NULL. For what? This really looks like "hide the potential problem", the caller should pass the valid data. register() also checks uc->next == NULL, probably to prevent the double-register but the caller can do other stupid/wrong things. If we do this check, then we should document that uc->next should be cleared before register() and add BUG_ON(). Also add the small comment about the i_size_read() check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08uprobes: Move __set_bit(UPROBE_SKIP_SSTEP) into alloc_uprobe()Oleg Nesterov1-3/+2
Cosmetic. __set_bit(UPROBE_SKIP_SSTEP) is the part of initialization, it is not clear why it is set in insert_uprobe(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08timeconst.pl: Eliminate Perl warningH. Peter Anvin1-4/+2
defined(@array) is deprecated in Perl and gives off a warning. Restructure the code to remove that warning. [ hpa: it would be interesting to revert to the timeconst.bc script. It appears that the failures reported by akpm during testing of that script was due to a known broken version of make, not a problem with bc. The Makefile rules could probably be restructured to avoid the make bug, or it is probably old enough that it doesn't matter. ] Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org>
2013-02-08srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock()Lai Jiangshan1-2/+1
The old SRCU implementation loads sp->completed within an RCU-sched section, courtesy of preempt_disable(). This was required due to the use of synchronize_sched() in the old implemenation's synchronize_srcu(). However, the new implementation does not rely on synchronize_sched(), so it in turn does not require the load of sp->completed and the ->c[] counter to be in a single preempt-disabled region of code. This commit therefore moves the sp->completed access outside of the preempt-disabled region and applies ACCESS_ONCE(). The resulting code is almost as the same as before, but it removes the now-misleading rcu_dereference_index_check() call. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08srcu: Update synchronize_srcu_expedited()'s commentsLai Jiangshan1-6/+5
Because synchronize_srcu_expedited() no longer uses synchronize_rcu_sched_expedited(), synchronize_srcu_expedited() no longer indirectly acquires any CPU-hotplug-related locks. This commit therefore updates the comments accordingly. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08srcu: Update synchronize_srcu()'s commentsLai Jiangshan1-4/+6
The core of SRCU is changed, but synchronize_srcu()'s comments describe the old algorithm. This commit therefore updates them to match the new algorithm. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08srcu: Simple cleanup for cleanup_srcu_struct()Lai Jiangshan1-6/+2
Pack six lines of code into two lines. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08srcu: Add might_sleep() annotation to synchronize_srcu()Lai Jiangshan1-0/+1
Although synchronize_srcu() can sleep, it will not sleep if the fast path succeeds, which means that illegal use of synchronize_rcu() might go unnoticed. This commit therefore adds might_sleep(), which unconditionally catches illegal use of synchronize_rcu() from atomic context. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08srcu: Simplify __srcu_read_unlock() via this_cpu_dec()Lai Jiangshan1-3/+1
This commit replaces disabling of preemption and decrement of a per-CPU variable with this_cpu_dec(), which avoids preemption disabling on x86 and shortens the code on all platforms. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-08workqueue: pick cwq instead of pool in __queue_work()Lai Jiangshan1-16/+13
Currently, __queue_work() chooses the pool to queue a work item to and then determines cwq from the target wq and the chosen pool. This is a bit backwards in that we can determine cwq first and simply use cwq->pool. This way, we can skip get_std_worker_pool() in queueing path which will be a hurdle when implementing custom worker pools. Update __queue_work() such that it chooses the target cwq and then use cwq->pool instead of the other way around. While at it, add missing {} in an if statement. This patch doesn't introduce any functional changes. tj: The original patch had two get_cwq() calls - the first to determine the pool by doing get_cwq(cpu, wq)->pool and the second to determine the matching cwq from get_cwq(pool->cpu, wq). Updated the function such that it chooses cwq instead of pool and removed the second call. Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-08workqueue: make get_work_pool_id() cheaperLai Jiangshan1-2/+6
get_work_pool_id() currently first obtains pool using get_work_pool() and then return pool->id. For an off-queue work item, this involves obtaining pool ID from worker->data, performing idr_find() to find the matching pool and then returning its pool->id which of course is the same as the one which went into idr_find(). Just open code WORK_STRUCT_CWQ case and directly return pool ID from work->data. tj: The original patch dropped on-queue work item handling and renamed the function to offq_work_pool_id(). There isn't much benefit in doing so. Handling it only requires a single if() and we need at least BUG_ON(), which is also a branch, even if we drop on-queue handling. Open code WORK_STRUCT_CWQ case and keep the function in line with get_work_pool(). Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-08workqueue: move nr_running into worker_poolTejun Heo1-41/+22
As nr_running is likely to be accessed from other CPUs during try_to_wake_up(), it was kept outside worker_pool; however, while less frequent, other fields in worker_pool are accessed from other CPUs for, e.g., non-reentrancy check. Also, with recent pool related changes, accessing nr_running matching the worker_pool isn't as simple as it used to be. Move nr_running inside worker_pool. Keep it aligned to cacheline and define CPU pools using DEFINE_PER_CPU_SHARED_ALIGNED(). This should give at least the same cacheline behavior. get_pool_nr_running() is replaced with direct pool->nr_running accesses. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com>
2013-02-07sched/rt: Move rt specific bits into new header fileClark Williams12-1/+13
Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07sched/rt: Add a tuning knob to allow changing SCHED_RR timesliceClark Williams3-2/+30
Add a /proc/sys/kernel scheduler knob named sched_rr_timeslice_ms that allows global changing of the SCHED_RR timeslice value. User visable value is in milliseconds but is stored as jiffies. Setting to 0 (zero) resets to the default (currently 100ms). Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07sched: Move sched.h sysctl bits into separate headerClark Williams4-0/+4
Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07workqueue: cosmetic update in try_to_grab_pending()Tejun Heo1-21/+17
With the recent is-work-queued-here test simplification, the nested if() in try_to_grab_pending() can be collapsed. Collapse it. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-02-07workqueue: simplify is-work-item-queued-here testLai Jiangshan1-24/+16
Currently, determining whether a work item is queued on a locked pool involves somewhat convoluted memory barrier dancing. It goes like the following. * When a work item is queued on a pool, work->data is updated before work->entry is linked to the pending list with a wmb() inbetween. * When trying to determine whether a work item is currently queued on a pool pointed to by work->data, it locks the pool and looks at work->entry. If work->entry is linked, we then do rmb() and then check whether work->data points to the current pool. This works because, work->data can only point to a pool if it currently is or were on the pool and, * If it currently is on the pool, the tests would obviously succeed. * It it left the pool, its work->entry was cleared under pool->lock, so if we're seeing non-empty work->entry, it has to be from the work item being linked on another pool. Because work->data is updated before work->entry is linked with wmb() inbetween, work->data update from another pool is guaranteed to be visible if we do rmb() after seeing non-empty work->entry. So, we either see empty work->entry or we see updated work->data pointin to another pool. While this works, it's convoluted, to put it mildly. With recent updates, it's now guaranteed that work->data points to cwq only while the work item is queued and that updating work->data to point to cwq or back to pool is done under pool->lock, so we can simply test whether work->data points to cwq which is associated with the currently locked pool instead of the convoluted memory barrier dancing. This patch replaces the memory barrier based "are you still here, really?" test with much simpler "does work->data points to me?" test - if work->data points to a cwq which is associated with the currently locked pool, the work item is guaranteed to be queued on the pool as work->data can start and stop pointing to such cwq only under pool->lock and the start and stop coincide with queue and dequeue. tj: Rewrote the comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07workqueue: make work->data point to pool after try_to_grab_pending()Lai Jiangshan1-0/+10
We plan to use work->data pointing to cwq as the synchronization invariant when determining whether a given work item is on a locked pool or not, which requires work->data pointing to cwq only while the work item is queued on the associated pool. With delayed_work updated not to overload work->data for target workqueue recording, the only case where we still have off-queue work->data pointing to cwq is try_to_grab_pending() which doesn't update work->data after stealing a queued work item. There's no reason for try_to_grab_pending() to not update work->data to point to the pool instead of cwq, like the normal execution does. This patch adds set_work_pool_and_keep_pending() which makes work->data point to pool instead of cwq but keeps the pending bit unlike set_work_pool_and_clear_pending() (surprise!). After this patch, it's guaranteed that only queued work items point to cwqs. This patch doesn't introduce any visible behavior change. tj: Renamed the new helper function to match set_work_pool_and_clear_pending() and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07workqueue: add delayed_work->wq to simplify reentrancy handlingLai Jiangshan1-29/+3
To avoid executing the same work item from multiple CPUs concurrently, a work_struct records the last pool it was on in its ->data so that, on the next queueing, the pool can be queried to determine whether the work item is still executing or not. A delayed_work goes through timer before actually being queued on the target workqueue and the timer needs to know the target workqueue and CPU. This is currently achieved by modifying delayed_work->work.data such that it points to the cwq which points to the target workqueue and the last CPU the work item was on. __queue_delayed_work() extracts the last CPU from delayed_work->work.data and then combines it with the target workqueue to create new work.data. The only thing this rather ugly hack achieves is encoding the target workqueue into delayed_work->work.data without using a separate field, which could be a trade off one can make; unfortunately, this entangles work->data management between regular workqueue and delayed_work code by setting cwq pointer before the work item is actually queued and becomes a hindrance for further improvements of work->data handling. This can be easily made sane by adding a target workqueue field to delayed_work. While delayed_work is used widely in the kernel and this does make it a bit larger (<5%), I think this is the right trade-off especially given the prospect of much saner handling of work->data which currently involves quite tricky memory barrier dancing, and don't expect to see any measureable effect. Add delayed_work->wq and drop the delayed_work->work.data overloading. tj: Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07workqueue: make work_busy() test WORK_STRUCT_PENDING firstLai Jiangshan1-10/+6
Currently, work_busy() first tests whether the work has a pool associated with it and if not, considers it idle. This works fine even for delayed_work.work queued on timer, as __queue_delayed_work() sets cwq on delayed_work.work - a queued delayed_work always has its cwq and thus pool associated with it. However, we're about to update delayed_work queueing and this won't hold. Update work_busy() such that it tests WORK_STRUCT_PENDING before the associated pool. This doesn't make any noticeable behavior difference now. With work_pending() test moved, the function read a lot better with "if (!pool)" test flipped to positive. Flip it. While at it, lose the comment about now non-existent reentrant workqueues. tj: Reorganized the function and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_ENDLai Jiangshan1-5/+5
Now that workqueue has moved away from gcwqs, workqueue no longer has the need to have a CPU identifier indicating "no cpu associated" - we now use WORK_OFFQ_POOL_NONE instead - and most uses of WORK_CPU_NONE are gone. The only left usage is as the end marker for for_each_*wq*() iterators, where the name WORK_CPU_NONE is confusing w/o actual WORK_CPU_NONE usages. Similarly, WORK_CPU_LAST which equals WORK_CPU_NONE no longer makes sense. Replace both WORK_CPU_NONE and LAST with WORK_CPU_END. This patch doesn't introduce any functional difference. tj: s/WORK_CPU_LAST/WORK_CPU_END/ and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06Merge branch 'rcu/next' of ↵Ingo Molnar1-7/+16
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU build fixlet from Paul E. McKenney. Signed-off-by: Ingo Molnar <mingo@kernel.org>