summaryrefslogtreecommitdiff
path: root/include/linux/cgroup.h
AgeCommit message (Collapse)AuthorFilesLines
2013-05-24cgroup: fix a subtle bug in descendant pre-order walkTejun Heo1-1/+1
When cgroup_next_descendant_pre() initiates a walk, it checks whether the subtree root doesn't have any children and if not returns NULL. Later code assumes that the subtree isn't empty. This is broken because the subtree may become empty inbetween, which can lead to the traversal escaping the subtree by walking to the sibling of the subtree root. There's no reason to have the early exit path. Remove it along with the later assumption that the subtree isn't empty. This simplifies the code a bit and fixes the subtle bug. While at it, fix the comment of cgroup_for_each_descendant_pre() which was incorrectly referring to ->css_offline() instead of ->css_online(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org
2013-05-08aio: don't include aio.h in sched.hKent Overstreet1-0/+1
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-02Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-05-02take cgroup_open() and cpuset_open() to fs/proc/base.cAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-30Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this development cycle were: - full dynticks preparatory work by Frederic Weisbecker - factor out the cpu time accounting code better, by Li Zefan - multi-CPU load balancer cleanups and improvements by Joonsoo Kim - various smaller fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) sched: Fix init NOHZ_IDLE flag sched: Prevent to re-select dst-cpu in load_balance() sched: Rename load_balance_tmpmask to load_balance_mask sched: Move up affinity check to mitigate useless redoing overhead sched: Don't consider other cpus in our group in case of NEWLY_IDLE sched: Explicitly cpu_idle_type checking in rebalance_domains() sched: Change position of resched_cpu() in load_balance() sched: Fix wrong rq's runnable_avg update with rt tasks sched: Document task_struct::personality field sched/cpuacct/UML: Fix header file dependency bug on the UML build cgroup: Kill subsys.active flag sched/cpuacct: No need to check subsys active state sched/cpuacct: Initialize cpuacct subsystem earlier sched/cpuacct: Initialize root cpuacct earlier sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically sched/cpuacct: Clean up cpuacct.h sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field() sched/cpuacct: Remove redundant NULL checks in cpuacct_charge() sched/cpuacct: Add cpuacct_acount_field() sched/cpuacct: Add cpuacct_init() ...
2013-04-30Merge branch 'for-3.10' of ↵Linus Torvalds1-18/+152
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...
2013-04-30cgroup: remove css_get_nextMichal Hocko1-7/+0
Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-19cgroup: fix broken file xattrsLi Zefan1-3/+0
We should store file xattrs in struct cfent instead of struct cftype, because cftype is a type while cfent is object instance of cftype. For example each cgroup has a tasks file, and each tasks file is associated with a uniq cfent, but all those files share the same struct cftype. Alexey Kodanev reported a crash, which can be reproduced: # mount -t cgroup -o xattr /sys/fs/cgroup # mkdir /sys/fs/cgroup/test # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks # rmdir /sys/fs/cgroup/test # umount /sys/fs/cgroup oops! In this case, simple_xattrs_free() will free the same struct simple_xattrs twice. tj: Dropped unused local variable @cft from cgroup_diput(). Cc: <stable@vger.kernel.org> # 3.8.x Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-16memcg: force use_hierarchy if sane_behaviorTejun Heo1-0/+3
Turn on use_hierarchy by default if sane_behavior is specified and don't create .use_hierarchy file. It is debatable whether to remove .use_hierarchy file or make it ro as the former could make transition easier in certain cases; however, the behavior changes which will be gated by sane_behavior are intensive including changing basic meaning of certain control knobs in a few controllers and I don't really think keeping this piece would make things easier in any noticeable way, so let's remove it. v2: Explain that mem_cgroup_bind() doesn't have to worry about children as suggested by Michal Hocko. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2013-04-15cgroup: remove cgrp->top_cgroupLi Zefan1-1/+0
It's not used, and it can be retrieved via cgrp->root->top_cgroup. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-15cgroup: introduce sane_behavior mount optionTejun Heo1-0/+43
It's a sad fact that at this point various cgroup controllers are carrying so many idiosyncrasies and pure insanities that it simply isn't possible to reach any sort of sane consistent behavior while maintaining staying fully compatible with what already has been exposed to userland. As we can't break exposed userland interface, transitioning to sane behaviors can only be done in steps while maintaining backwards compatibility. This patch introduces a new mount option - __DEVEL__sane_behavior - which disables crazy features and enforces consistent behaviors in cgroup core proper and various controllers. As exactly which behaviors it changes are still being determined, the mount option, at this point, is useful only for development of the new behaviors. As such, the mount option is prefixed with __DEVEL__ and generates a warning message when used. Eventually, once we get to the point where all controller's behaviors are consistent enough to implement unified hierarchy, the __DEVEL__ prefix will be dropped, and more importantly, unified-hierarchy will enforce sane_behavior by default. Maybe we'll able to completely drop the crazy stuff after a while, maybe not, but we at least have a strategy to move on to saner behaviors. This patch introduces the mount option and changes the following behaviors in cgroup core. * Mount options "noprefix" and "clone_children" are disallowed. Also, cgroupfs file cgroup.clone_children is not created. * When mounting an existing superblock, mount options should match. This is currently pretty crazy. If one mounts a cgroup, creates a subdirectory, unmounts it and then mount it again with different option, it looks like the new options are applied but they aren't. * Remount is disallowed. The behaviors changes are documented in the comment above CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different controllers are converted and planned improvements progress. v2: Dropped unnecessary explicit file permission setting sane_behavior cftype entry as suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com>
2013-04-15move cgroupfs_root to include/linux/cgroup.hTejun Heo1-0/+57
While controllers shouldn't be accessing cgroupfs_root directly, it being hidden inside kern/cgroup.c makes somethings pretty silly. This makes routing hierarchy-wide settings which need to be visible to controllers cumbersome. We're gonna add another hierarchy-wide setting which needs to be accessed from controllers. Move cgroupfs_root and its flags to the header file so that we can access root settings with inline helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-12Revert "cgroup: remove bind() method from cgroup_subsys."Tejun Heo1-0/+2
This reverts commit 84cfb6ab484b442d5115eb3baf9db7d74a3ea626. There are scheduled changes which make use of the removed callback. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rami Rosen <ramirose@gmail.com> Cc: Li Zefan <lizefan@huawei.com>
2013-04-10cgroup: implement cgroup_is_descendant()Li Zefan1-0/+1
A couple controllers want to determine whether two cgroups are in ancestor/descendant relationship. As it's more likely that the descendant is the primary subject of interest and there are other operations focusing on the descendants, let's ask is_descendent rather than is_ancestor. Implementation is trivial as the previous patch guarantees that all ancestors of a cgroup stay accessible as long as the cgroup is accessible. tj: Removed depth optimization, renamed from cgroup_is_ancestor(), rewrote descriptions. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10cgroup: remove bind() method from cgroup_subsys.Rami Rosen1-2/+0
The bind() method of cgroup_subsys is not used in any of the controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio, devices, perf, hugetlb, cpu and cpuacct) tj: Removed the entry on ->bind() from Documentation/cgroups/cgroups.txt. Also updated a couple paragraphs which were suggesting that dynamic re-binding may be implemented. It's not gonna. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10cgroup: Kill subsys.active flagLi Zefan1-1/+0
The only user was cpuacct. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-07cgroup: remove cgroup_lock_is_held()Tejun Heo1-4/+9
We don't want controllers to assume that the information is officially available and do funky things with it. The only user is task_subsys_state_check() which uses it to verify RCU access context. We can move cgroup_lock_is_held() inside CONFIG_PROVE_RCU but that doesn't add meaningful protection compared to conditionally exposing cgroup_mutex. Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU and use lockdep_is_held() directly on the mutex in task_subsys_state_check(). While at it, add parentheses around macro arguments in task_subsys_state_check(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07cgroup: unexport locking interface and cgroup_attach_task()Tejun Heo1-5/+0
Now that all external cgroup_lock() users are gone, we can finally unexport the locking interface and prevent future abuse of cgroup_mutex. Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also, cgroup_attach_task() doesn't have any user left and can't be used without locking interface anyway. Make it static too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07cgroup, cpuset: replace move_member_tasks_to_cpuset() with ↵Tejun Heo1-0/+1
cgroup_transfer_tasks() When a cpuset becomes empty (no CPU or memory), its tasks are transferred with the nearest ancestor with execution resources. This is implemented using cgroup_scan_tasks() with a callback which grabs cgroup_mutex and invokes cgroup_attach_task() on each task. Both cgroup_mutex and cgroup_attach_task() are scheduled to be unexported. Implement cgroup_transfer_tasks() in cgroup proper which is essentially the same as move_member_tasks_to_cpuset() except that it takes cgroups instead of cpusets and @to comes before @from like normal functions with those arguments, and replace move_member_tasks_to_cpuset() with it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-03-20cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()Li Zefan1-1/+2
These two functions share most of the code. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13cgroup: remove cgroup_is_descendant()Li Zefan1-3/+0
It was used by ns cgroup, and ns cgroup was removed long ago. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-05cgroup: avoid accessing modular cgroup subsys structure without lockingLi Zefan1-3/+14
subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload, and that's protected by cgroup_mutex, and then the memory *subsys[i] resides will be freed. So this is unsafe without any locking: if (!ss || ss->module) ... v2: - add a comment for enum cgroup_subsys_id - simplify the comment in cgroup_exit() Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-04cgroup: fix cgroup_path() vs rename() raceLi Zefan1-0/+24
rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. As accessing dentry->name must be protected by dentry->d_lock or parent inode's i_mutex, while on the other hand cgroup-path() can be called with some irq-safe spinlocks held, we can't generate cgroup path using dentry->d_name. Alternatively we make a copy of dentry->d_name and save it in cgrp->name when a cgroup is created, and update cgrp->name at rename(). v5: use flexible array instead of zero-size array. v4: - allocate root_cgroup_name and all root_cgroup->name points to it. - add cgroup_name() wrapper. v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path. v2: make cgrp->name RCU safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-25cgroup: remove synchronize_rcu() from cgroup_diput()Li Zefan1-0/+1
Free cgroup via call_rcu(). The actual work is done through workqueue. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-07cgroup: implement cgroup_rightmost_descendant()Tejun Heo1-0/+1
Implement cgroup_rightmost_descendant() which returns the right most descendant of the specified cgroup. This can be used to skip the cgroup's subtree while iterating with cgroup_for_each_descendant_pre(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cgroup: remove unused dummy cgroup_fork_callbacks()Tejun Heo1-1/+0
5edee61ede ("cgroup: cgroup_subsys->fork() should be called after the task is added to css_set") removed cgroup_fork_callbacks() but forgot to remove its dummy version for !CONFIG_CGROUPS. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
2012-11-20cgroup: remove obsolete guarantee from cgroup_task_migrate.Tao Ma1-2/+2
'guarantee' is already removed from cgroup_task_migrate, so remove the corresponding comments. Some other typos in cgroup are also changed. Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-19cgroup: add cgroup->idTejun Heo1-0/+2
With the introduction of generic cgroup hierarchy iterators, css_id is being phased out. It was unnecessarily complex, id'ing the wrong thing (cgroups need IDs, not CSSes) and has other oddities like not being available at ->css_alloc(). This patch adds cgroup->id, which is a simple per-hierarchy ida-allocated ID which is assigned before ->css_alloc() and released after ->css_free(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>
2012-11-19cgroup, cpuset: remove cgroup_subsys->post_clone()Tejun Heo1-1/+0
Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now that clone_children is cpuset specific, there's no reason to have this rather odd option activation mechanism in cgroup core. cpuset can check the flag from its ->css_allocate() and take the necessary action. Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and remove cgroup_subsys->post_clone(). Loosely based on Glauber's "generalize post_clone into post_create" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Glauber Costa <glommer@parallels.com> Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>
2012-11-19cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/Tejun Heo1-2/+4
clone_children is only meaningful for cpuset and will stay that way. Rename the flag to reflect that and update documentation. Also, drop clone_children() wrapper in cgroup.c. The thin wrapper is used only a few times and one of them will go away soon. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>
2012-11-19cgroup: rename ->create/post_create/pre_destroy/destroy() to ↵Tejun Heo1-17/+18
->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19cgroup: allow ->post_create() to failTejun Heo1-1/+1
There could be cases where controllers want to do initialization operations which may fail from ->post_create(). This patch makes ->post_create() return -errno to indicate failure and online_css() relay such failures. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>
2012-11-19cgroup: introduce CSS_ONLINE flag and on/offline_css() helpersTejun Heo1-0/+1
New helpers on/offline_css() respectively wrap ->post_create() and ->pre_destroy() invocations. online_css() sets CSS_ONLINE after ->post_create() is complete and offline_css() invokes ->pre_destroy() iff CSS_ONLINE is set and clears it while also handling the temporary dropping of cgroup_mutex. This patch doesn't introduce any behavior change at the moment but will be used to improve cgroup_create() failure path and allow ->post_create() to fail. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19cgroup: make CSS_* flags bit masks instead of bit positionsTejun Heo1-4/+4
Currently, CSS_* flags are defined as bit positions and manipulated using atomic bitops. There's no reason to use atomic bitops for them and bit positions are clunkier to deal with than bit masks. Make CSS_* bit masks instead and use the usual C bitwise operators to access them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19cgroup: cgroup->dentry isn't a RCU pointerTejun Heo1-1/+1
cgroup->dentry is marked and used as a RCU pointer; however, it isn't one - the final dentry put doesn't go through call_rcu(). cgroup and dentry share the same RCU freeing rule via synchronize_rcu() in cgroup_diput() (kfree_rcu() used on cgrp is unnecessary). If cgrp is accessible under RCU read lock, so is its dentry and dereferencing cgrp->dentry doesn't need any further RCU protection or annotation. While not being accurate, before the previous patch, the RCU accessors served a purpose as memory barriers - cgroup->dentry used to be assigned after the cgroup was made visible to cgroup_path(), so the assignment and dereferencing in cgroup_path() needed the memory barrier pair. Now that list_add_tail_rcu() happens after cgroup->dentry is assigned, this no longer is necessary. Remove the now unnecessary and misleading RCU annotations from cgroup->dentry. To make up for the removal of rcu_dereference_check() in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts the dereference rule of @cgrp, not cgrp->dentry. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-09cgroup: implement generic child / descendant walk macrosTejun Heo1-0/+94
Currently, cgroup doesn't provide any generic helper for walking a given cgroup's children or descendants. This patch adds the following three macros. * cgroup_for_each_child() - walk immediate children of a cgroup. * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup in pre-order tree traversal. * cgroup_for_each_descendant_post() - visit all descendants of a cgroup in post-order tree traversal. All three only require the user to hold RCU read lock during traversal. Verifying that each iterated cgroup is online is the responsibility of the user. When used with proper synchronization, cgroup_for_each_descendant_pre() can be used to propagate state updates to descendants in reliable way. See comments for details. v2: s/config/state/ in commit message and comments per Michal. More documentation on synchronization rules. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-09cgroup: use rculist ops for cgroup->childrenTejun Heo1-0/+1
Use RCU safe list operations for cgroup->children. This will be used to implement cgroup children / descendant walking which can be used by controllers. Note that cgroup_create() now puts a new cgroup at the end of the ->children list instead of head. This isn't strictly necessary but is done so that the iteration order is more conventional. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-09cgroup: add cgroup_subsys->post_create()Tejun Heo1-0/+1
Currently, there's no way for a controller to find out whether a new cgroup finished all ->create() allocatinos successfully and is considered "live" by cgroup. This becomes a problem later when we add generic descendants walking to cgroup which can be used by controllers as controllers don't have a synchronization point where it can synchronize against new cgroups appearing in such walks. This patch adds ->post_create(). It's called after all ->create() succeeded and the cgroup is linked into the generic cgroup hierarchy. This plays the counterpart of ->pre_destroy(). When used in combination with the to-be-added generic descendant iterators, ->post_create() can be used to implement reliable state inheritance. It will be explained with the descendant iterators. v2: Added a paragraph about its future use w/ descendant iterators per Michal. v3: Forgot to add ->post_create() invocation to cgroup_load_subsys(). Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>
2012-11-05Merge branch 'cgroup-rmdir-updates' into cgroup/for-3.8Tejun Heo1-40/+1
Pull rmdir updates into for-3.8 so that further callback updates can be put on top. This pull created a trivial conflict between the following two commits. 8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them") ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs") The former added a field to cgroup_subsys and the latter removed one from it. They happen to be colocated causing the conflict. Keeping what's added and removing what's removed resolves the conflict. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-05cgroup: make ->pre_destroy() return voidTejun Heo1-1/+1
All ->pre_destory() implementations return 0 now, which is the only allowed return value. Make it return void. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com>
2012-11-05cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and ↵Tejun Heo1-21/+0
cgroup_release_and_wakeup_rmdir() CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup destruction rollback somewhat working. cgroup_rmdir() used to drain CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and helpers were used to allow the task performing rmdir to wait for the next relevant event. Unfortunately, the wait is visible to controllers too and the mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent sleep at rmdir"). Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is unnecessary. Remove it and all the mechanisms supporting it. Note that memcontrol.c changes are essentially revert of 887032670d ("cgroup avoid permanent sleep at rmdir"). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05cgroup: kill CSS_REMOVEDTejun Heo1-6/+0
CSS_REMOVED is one of the several contortions which were necessary to support css reference draining on cgroup removal. All css->refcnts which need draining should be deactivated and verified to equal zero atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts needed to be re-activated and css_tryget() shouldn't fail in the process. This was achieved by letting css_tryget() busy-loop until either the refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set (committing to removal). Now that css refcnt draining is no longer used, there's no need for atomic rollback mechanism. css_tryget() simply can look at the reference count and fail if it's deactivated - it's never getting re-activated. This patch removes CSS_REMOVED and updates __css_tryget() to fail if the refcnt is deactivated. As deactivation and removal are a single step now, they no longer need to be protected against css_tryget() happening from irq context. Remove local_irq_disable/enable() from cgroup_rmdir(). Note that this removes css_is_removed() whose only user is VM_BUG_ON() in memcontrol.c. We can replace it with a check on the refcnt but given that the only use case is a debug assert, I think it's better to simply unexport it. v2: Comment updated and explanation on local_irq_disable/enable() added per Michal Hocko. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refsTejun Heo1-12/+0
2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error handling") removed the last user of __DEPRECATED_clear_css_refs. This patch removes __DEPRECATED_clear_css_refs and mechanisms to support it. * Conditionals dependent on __DEPRECATED_clear_css_refs removed. * cgroup_clear_css_refs() can no longer fail. All that needs to be done are deactivating refcnts, setting CSS_REMOVED and putting the base reference on each css. Remove cgroup_clear_css_refs() and the failure path, and open-code the loops into cgroup_rmdir(). This patch keeps the two for_each_subsys() loops separate while open coding them. They can be merged now but there are scheduled changes which need them to be separate, so keep them separate to reduce the amount of churn. local_irq_save/restore() from cgroup_clear_css_refs() are replaced with local_irq_disable/enable() for simplicity. This is safe as cgroup_rmdir() is always called with IRQ enabled. Note that this IRQ switching is necessary to ensure that css_tryget() isn't called from IRQ context on the same CPU while lower context is between CSS deactivation and setting CSS_REMOVED as css_tryget() would hang forever in such cases waiting for CSS to be re-activated or CSS_REMOVED set. This will go away soon. v2: cgroup_call_pre_destroy() removal dropped per Michal. Commit message updated to explain local_irq_disable/enable() conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com>
2012-10-17cgroup: cgroup_subsys->fork() should be called after the task is added to ↵Tejun Heo1-1/+0
css_set cgroup core has a bug which violates a basic rule about event notifications - when a new entity needs to be added, you add that to the notification list first and then make the new entity conform to the current state. If done in the reverse order, an event happening inbetween will be lost. cgroup_subsys->fork() is invoked way before the new task is added to the css_set. Currently, cgroup_freezer is the only user of ->fork() and uses it to make new tasks conform to the current state of the freezer. If FROZEN state is requested while fork is in progress between cgroup_fork_callbacks() and cgroup_post_fork(), the child could escape freezing - the cgroup isn't frozen when ->fork() is called and the freezer couldn't see the new task on the css_set. This patch moves cgroup_subsys->fork() invocation to cgroup_post_fork() after the new task is added to the css_set. cgroup_fork_callbacks() is removed. Because now a task may be migrated during cgroup_subsys->fork(), freezer_fork() is updated so that it adheres to the usual RCU locking and the rather pointless comment on why locking can be different there is removed (if it doesn't make anything simpler, why even bother?). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org
2012-10-02Merge branch 'for-3.7-hierarchy' of ↵Linus Torvalds1-0/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup hierarchy update from Tejun Heo: "Currently, different cgroup subsystems handle nested cgroups completely differently. There's no consistency among subsystems and the behaviors often are outright broken. People at least seem to agree that the broken hierarhcy behaviors need to be weeded out if any progress is gonna be made on this front and that the fallouts from deprecating the broken behaviors should be acceptable especially given that the current behaviors don't make much sense when nested. This patch makes cgroup emit warning messages if cgroups for subsystems with broken hierarchy behavior are nested to prepare for fixing them in the future. This was put in a separate branch because more related changes were expected (didn't make it this round) and the memory cgroup wanted to pull in this and make changes on top." * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-09-14cgroup: mark subsystems with broken hierarchy support and whine if cgroups ↵Tejun Heo1-0/+15
are nested for them Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2012-09-14cgroup: Define CGROUP_SUBSYS_COUNT according the configurationDaniel Wagner1-7/+1
Since we know exactly how many subsystems exists at compile time we are able to define CGROUP_SUBSYS_COUNT correctly. CGROUP_SUBSYS_COUNT will be at max 12 (all controllers enabled). Depending on the architecture we safe either 32 - 12 pointers (80 bytes) or 64 - 12 pointers (416 bytes) per cgroup. With this change we can also remove the temporary placeholder to avoid compilation errors. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org
2012-09-14cgroup: Assign subsystem IDs during compile timeDaniel Wagner1-1/+1
WARNING: With this change it is impossible to load external built controllers anymore. In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is set, corresponding subsys_id should also be a constant. Up to now, net_prio_subsys_id and net_cls_subsys_id would be of the type int and the value would be assigned during runtime. By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN to IS_ENABLED, all *_subsys_id will have constant value. That means we need to remove all the code which assumes a value can be assigned to net_prio_subsys_id and net_cls_subsys_id. A close look is necessary on the RCU part which was introduces by following patch: commit f845172531fb7410c7fb7780b1a6e51ee6df7d52 Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010 Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010 cls_cgroup: Store classid in struct sock Tis code was added to init_cgroup_cls() /* We can't use rcu_assign_pointer because this is an int. */ smp_wmb(); net_cls_subsys_id = net_cls_subsys.subsys_id; respectively to exit_cgroup_cls() net_cls_subsys_id = -1; synchronize_rcu(); and in module version of task_cls_classid() rcu_read_lock(); id = rcu_dereference(net_cls_subsys_id); if (id >= 0) classid = container_of(task_subsys_state(p, id), struct cgroup_cls_state, css)->classid; rcu_read_unlock(); Without an explicit explaination why the RCU part is needed. (The rcu_deference was fixed by exchanging it to rcu_derefence_index_check() in a later commit, but that is a minor detail.) So here is my pondering why it was introduced and why it safe to remove it now. Note that this code was copied over to net_prio the reasoning holds for that subsystem too. The idea behind the RCU use for net_cls_subsys_id is to make sure we get a valid pointer back from task_subsys_state(). task_subsys_state() is just blindly accessing the subsys array and returning the pointer. Obviously, passing in -1 as id into task_subsys_state() returns an invalid value (out of lower bound). So this code makes sure that only after module is loaded and the subsystem registered, the id is assigned. Before unregistering the module all old readers must have left the critical section. This is done by assigning -1 to the id and issuing a synchronized_rcu(). Any new readers wont call task_subsys_state() anymore and therefore it is safe to unregister the subsystem. The new code relies on the same trick, but it looks at the subsys pointer return by task_subsys_state() (remember the id is constant and therefore we allways have a valid index into the subsys array). No precautions need to be taken during module loading module. Eventually, all CPUs will get a valid pointer back from task_subsys_state() because rebind_subsystem() which is called after the module init() function will assigned subsys[net_cls_subsys_id] the newly loaded module subsystem pointer. When the subsystem is about to be removed, rebind_subsystem() will called before the module exit() function. In this case, rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer and then it calls synchronize_rcu(). All old readers have left by then the critical section. Any new reader wont access the subsystem anymore. At this point we are safe to unregister the subsystem. No synchronize_rcu() call is needed. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org
2012-09-14cgroup: Wrap subsystem selection macroDaniel Wagner1-0/+4
Before we are able to define all subsystem ids at compile time we need a more fine grained control what gets defined when we include cgroup_subsys.h. For example we define the enums for the subsystems or to declare for struct cgroup_subsys (builtin subsystem) by including cgroup_subsys.h and defining SUBSYS accordingly. Currently, the decision if a subsys is used is defined inside the header by testing if CONFIG_*=y is true. By moving this test outside of cgroup_subsys.h we are able to control it on the include level. This is done by introducing IS_SUBSYS_ENABLED which then is defined according the task, e.g. is CONFIG_*=y or CONFIG_*=m. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org
2012-09-14cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNTDaniel Wagner1-1/+1
CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when looping over the subsys array looking either at the builtin or the module subsystems. Since all the builtin subsystems have an id which is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids are sorted. We are about to change id assignment to happen only at compile time later in this series. That means we can't rely on the above trick since all ids will always be defined at compile time. Furthermore, ordering the builtin subsystems and the module subsystems is not really necessary. So we need a different way to know which subsystem is a builtin or a module one. We can use the subsys[]->module pointer for this. Any place where we need to know if a subsys is module we just check for the pointer. If it is NULL then the subsystem is a builtin one. With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT enum. Though we need to introduce a temporary placeholder so that we don't get a compilation error when only CONFIG_CGROUP is selected and no single controller. An empty enum definition is not valid. Later in this series we are able to remove the placeholder again. And with this change we get a fix for this: kernel/cgroup.c: In function ‘cgroup_load_subsys’: kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds] when CONFIG_CGROUP=y and no built in controller was enabled. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org