From e7e79d147cd7fe21fa78098fa57d80d09f00d8c1 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 16 Jul 2017 21:32:38 -0400 Subject: cgroup: update outdated cgroup.procs documentation 576dd464505f ("cgroup: drop the matching uid requirement on migration for cgroup v2") dropped the uid match requirement from "cgroup.procs" perm check but forgot to update the matching entry in the documentation. Update it. Signed-off-by: Tejun Heo --- Documentation/cgroup-v2.txt | 3 --- 1 file changed, 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index bde177103567..f01f831a3b11 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -658,9 +658,6 @@ All cgroup core files are prefixed with "cgroup." the PID to the cgroup. The writer should match all of the following conditions. - - Its euid is either root or must match either uid or suid of - the target process. - - It must have write access to the "cgroup.procs" file. - It must have write access to the "cgroup.procs" file of the -- cgit v1.2.3 From 8cfd8147df67e741d93b8783a3ea8f3c74f93a0e Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 21 Jul 2017 11:14:51 -0400 Subject: cgroup: implement cgroup v2 thread support This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. A cgroup is always created as a domain and can be made threaded by writing to the "cgroup.type" file. When a cgroup becomes threaded, it becomes a member of a threaded subtree which is anchored at the closest ancestor which isn't threaded. The threads of the processes which are in a threaded subtree can be placed anywhere without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, the nearest ancestor which isn't threaded, is called the threaded domain and serves as the resource domain for the whole subtree. This is the last cgroup where domain controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. As the root cgroup is exempt from the no-internal-process constraint, it can serve as both a threaded domain and a parent to normal cgroups, so, unlike non-root cgroups, the root cgroup can have both domain and threaded children. Internally, in a threaded subtree, each css_set has its ->dom_cset pointing to a matching css_set which belongs to the threaded domain. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create(). Spotted by Waiman. - Documentation updated as suggested by Waiman. - cgroup.type content slightly reformatted. - Mark the debug controller threaded. v4: - Updated to the general idea of marking specific cgroups domain/threaded as suggested by PeterZ. v3: - Dropped "join" and always make mixed children join the parent's threaded subtree. v2: - After discussions with Waiman, support for mixed thread mode is added. This should address the issue that Peter pointed out where any nesting should be avoided for thread subtrees while coexisting with other domain cgroups. - Enabling / disabling thread mode now piggy backs on the existing control mask update mechanism. - Bug fixes and cleanup. Signed-off-by: Tejun Heo Cc: Waiman Long Cc: Peter Zijlstra --- Documentation/cgroup-v2.txt | 185 +++++++++++++++++++-- include/linux/cgroup-defs.h | 12 ++ kernel/cgroup/cgroup-internal.h | 2 +- kernel/cgroup/cgroup-v1.c | 5 +- kernel/cgroup/cgroup.c | 355 +++++++++++++++++++++++++++++++++++++--- kernel/cgroup/debug.c | 1 + kernel/cgroup/pids.c | 1 + kernel/events/core.c | 1 + 8 files changed, 522 insertions(+), 40 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index f01f831a3b11..cb9ea281ab72 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -18,7 +18,9 @@ v1 is available under Documentation/cgroup-v1/. 1-2. What is cgroup? 2. Basic Operations 2-1. Mounting - 2-2. Organizing Processes + 2-2. Organizing Processes and Threads + 2-2-1. Processes + 2-2-2. Threads 2-3. [Un]populated Notification 2-4. Controlling Controllers 2-4-1. Enabling and Disabling @@ -167,8 +169,11 @@ cgroup v2 currently supports the following mount options. Delegation section for details. -Organizing Processes --------------------- +Organizing Processes and Threads +-------------------------------- + +Processes +~~~~~~~~~ Initially, only the root cgroup exists to which all processes belong. A child cgroup can be created by creating a sub-directory:: @@ -219,6 +224,104 @@ is removed subsequently, " (deleted)" is appended to the path:: 0::/test-cgroup/test-cgroup-nested (deleted) +Threads +~~~~~~~ + +cgroup v2 supports thread granularity for a subset of controllers to +support use cases requiring hierarchical resource distribution across +the threads of a group of processes. By default, all threads of a +process belong to the same cgroup, which also serves as the resource +domain to host resource consumptions which are not specific to a +process or thread. The thread mode allows threads to be spread across +a subtree while still maintaining the common resource domain for them. + +Controllers which support thread mode are called threaded controllers. +The ones which don't are called domain controllers. + +Marking a cgroup threaded makes it join the resource domain of its +parent as a threaded cgroup. The parent may be another threaded +cgroup whose resource domain is further up in the hierarchy. The root +of a threaded subtree, that is, the nearest ancestor which is not +threaded, is called threaded domain or thread root interchangeably and +serves as the resource domain for the entire subtree. + +Inside a threaded subtree, threads of a process can be put in +different cgroups and are not subject to the no internal process +constraint - threaded controllers can be enabled on non-leaf cgroups +whether they have threads in them or not. + +As the threaded domain cgroup hosts all the domain resource +consumptions of the subtree, it is considered to have internal +resource consumptions whether there are processes in it or not and +can't have populated child cgroups which aren't threaded. Because the +root cgroup is not subject to no internal process constraint, it can +serve both as a threaded domain and a parent to domain cgroups. + +The current operation mode or type of the cgroup is shown in the +"cgroup.type" file which indicates whether the cgroup is a normal +domain, a domain which is serving as the domain of a threaded subtree, +or a threaded cgroup. + +On creation, a cgroup is always a domain cgroup and can be made +threaded by writing "threaded" to the "cgroup.type" file. The +operation is single direction:: + + # echo threaded > cgroup.type + +Once threaded, the cgroup can't be made a domain again. To enable the +thread mode, the following conditions must be met. + +- As the cgroup will join the parent's resource domain. The parent + must either be a valid (threaded) domain or a threaded cgroup. + +- The cgroup must be empty. No enabled controllers, child cgroups or + processes. + +Topology-wise, a cgroup can be in an invalid state. Please consider +the following toplogy:: + + A (threaded domain) - B (threaded) - C (domain, just created) + +C is created as a domain but isn't connected to a parent which can +host child domains. C can't be used until it is turned into a +threaded cgroup. "cgroup.type" file will report "domain (invalid)" in +these cases. Operations which fail due to invalid topology use +EOPNOTSUPP as the errno. + +A domain cgroup is turned into a threaded domain when one of its child +cgroup becomes threaded or threaded controllers are enabled in the +"cgroup.subtree_control" file while there are processes in the cgroup. +A threaded domain reverts to a normal domain when the conditions +clear. + +When read, "cgroup.threads" contains the list of the thread IDs of all +threads in the cgroup. Except that the operations are per-thread +instead of per-process, "cgroup.threads" has the same format and +behaves the same way as "cgroup.procs". While "cgroup.threads" can be +written to in any cgroup, as it can only move threads inside the same +threaded domain, its operations are confined inside each threaded +subtree. + +The threaded domain cgroup serves as the resource domain for the whole +subtree, and, while the threads can be scattered across the subtree, +all the processes are considered to be in the threaded domain cgroup. +"cgroup.procs" in a threaded domain cgroup contains the PIDs of all +processes in the subtree and is not readable in the subtree proper. +However, "cgroup.procs" can be written to from anywhere in the subtree +to migrate all threads of the matching process to the cgroup. + +Only threaded controllers can be enabled in a threaded subtree. When +a threaded controller is enabled inside a threaded subtree, it only +accounts for and controls resource consumptions associated with the +threads in the cgroup and its descendants. All consumptions which +aren't tied to a specific thread belong to the threaded domain cgroup. + +Because a threaded subtree is exempt from no internal process +constraint, a threaded controller must be able to handle competition +between threads in a non-leaf cgroup and its child cgroups. Each +threaded controller defines how such competitions are handled. + + [Un]populated Notification -------------------------- @@ -302,15 +405,15 @@ disabled if one or more children have it enabled. No Internal Process Constraint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Non-root cgroups can only distribute resources to their children when -they don't have any processes of their own. In other words, only -cgroups which don't contain any processes can have controllers enabled -in their "cgroup.subtree_control" files. +Non-root cgroups can distribute domain resources to their children +only when they don't have any processes of their own. In other words, +only domain cgroups which don't contain any processes can have domain +controllers enabled in their "cgroup.subtree_control" files. -This guarantees that, when a controller is looking at the part of the -hierarchy which has it enabled, processes are always only on the -leaves. This rules out situations where child cgroups compete against -internal processes of the parent. +This guarantees that, when a domain controller is looking at the part +of the hierarchy which has it enabled, processes are always only on +the leaves. This rules out situations where child cgroups compete +against internal processes of the parent. The root cgroup is exempt from this restriction. Root contains processes and anonymous resource consumption which can't be associated @@ -334,10 +437,10 @@ Model of Delegation ~~~~~~~~~~~~~~~~~~~ A cgroup can be delegated in two ways. First, to a less privileged -user by granting write access of the directory and its "cgroup.procs" -and "cgroup.subtree_control" files to the user. Second, if the -"nsdelegate" mount option is set, automatically to a cgroup namespace -on namespace creation. +user by granting write access of the directory and its "cgroup.procs", +"cgroup.threads" and "cgroup.subtree_control" files to the user. +Second, if the "nsdelegate" mount option is set, automatically to a +cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee @@ -644,6 +747,29 @@ Core Interface Files All cgroup core files are prefixed with "cgroup." + cgroup.type + + A read-write single value file which exists on non-root + cgroups. + + When read, it indicates the current type of the cgroup, which + can be one of the following values. + + - "domain" : A normal valid domain cgroup. + + - "domain threaded" : A threaded domain cgroup which is + serving as the root of a threaded subtree. + + - "domain invalid" : A cgroup which is in an invalid state. + It can't be populated or have controllers enabled. It may + be allowed to become a threaded cgroup. + + - "threaded" : A threaded cgroup which is a member of a + threaded subtree. + + A cgroup can be turned into a threaded cgroup by writing + "threaded" to this file. + cgroup.procs A read-write new-line separated values file which exists on all cgroups. @@ -666,6 +792,35 @@ All cgroup core files are prefixed with "cgroup." When delegating a sub-hierarchy, write access to this file should be granted along with the containing directory. + In a threaded cgroup, reading this file fails with EOPNOTSUPP + as all the processes belong to the thread root. Writing is + supported and moves every thread of the process to the cgroup. + + cgroup.threads + A read-write new-line separated values file which exists on + all cgroups. + + When read, it lists the TIDs of all threads which belong to + the cgroup one-per-line. The TIDs are not ordered and the + same TID may show up more than once if the thread got moved to + another cgroup and then back or the TID got recycled while + reading. + + A TID can be written to migrate the thread associated with the + TID to the cgroup. The writer should match all of the + following conditions. + + - It must have write access to the "cgroup.threads" file. + + - The cgroup that the thread is currently in must be in the + same resource domain as the destination cgroup. + + - It must have write access to the "cgroup.procs" file of the + common ancestor of the source and destination cgroups. + + When delegating a sub-hierarchy, write access to this file + should be granted along with the containing directory. + cgroup.controllers A read-only space separated values file which exists on all cgroups. diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 651c4363c85e..9d741959f218 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -521,6 +521,18 @@ struct cgroup_subsys { */ bool implicit_on_dfl:1; + /* + * If %true, the controller, supports threaded mode on the default + * hierarchy. In a threaded subtree, both process granularity and + * no-internal-process constraint are ignored and a threaded + * controllers should be able to handle that. + * + * Note that as an implicit controller is automatically enabled on + * all cgroups on the default hierarchy, it should also be + * threaded. implicit && !threaded is not supported. + */ + bool threaded:1; + /* * If %false, this subsystem is properly hierarchical - * configuration, resource accounting and restriction on a parent diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 0e81c6109e91..f10eb19ddf04 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -170,7 +170,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, struct cgroup_root *root, unsigned long magic, struct cgroup_namespace *ns); -bool cgroup_may_migrate_to(struct cgroup *dst_cgrp); +int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp); void cgroup_migrate_finish(struct cgroup_mgctx *mgctx); void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp, struct cgroup_mgctx *mgctx); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 167aaab04bf9..f0e8601b13cb 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -99,8 +99,9 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) if (cgroup_on_dfl(to)) return -EINVAL; - if (!cgroup_may_migrate_to(to)) - return -EBUSY; + ret = cgroup_migrate_vet_dst(to); + if (ret) + return ret; mutex_lock(&cgroup_mutex); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a1d59af274a9..c396e701c206 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -162,6 +162,9 @@ static u16 cgrp_dfl_inhibit_ss_mask; /* some controllers are implicitly enabled on the default hierarchy */ static u16 cgrp_dfl_implicit_ss_mask; +/* some controllers can be threaded on the default hierarchy */ +static u16 cgrp_dfl_threaded_ss_mask; + /* The list of hierarchy roots */ LIST_HEAD(cgroup_roots); static int cgroup_root_count; @@ -335,14 +338,93 @@ static bool cgroup_is_threaded(struct cgroup *cgrp) return cgrp->dom_cgrp != cgrp; } +/* can @cgrp host both domain and threaded children? */ +static bool cgroup_is_mixable(struct cgroup *cgrp) +{ + /* + * Root isn't under domain level resource control exempting it from + * the no-internal-process constraint, so it can serve as a thread + * root and a parent of resource domains at the same time. + */ + return !cgroup_parent(cgrp); +} + +/* can @cgrp become a thread root? should always be true for a thread root */ +static bool cgroup_can_be_thread_root(struct cgroup *cgrp) +{ + /* mixables don't care */ + if (cgroup_is_mixable(cgrp)) + return true; + + /* domain roots can't be nested under threaded */ + if (cgroup_is_threaded(cgrp)) + return false; + + /* can only have either domain or threaded children */ + if (cgrp->nr_populated_domain_children) + return false; + + /* and no domain controllers can be enabled */ + if (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask) + return false; + + return true; +} + +/* is @cgrp root of a threaded subtree? */ +static bool cgroup_is_thread_root(struct cgroup *cgrp) +{ + /* thread root should be a domain */ + if (cgroup_is_threaded(cgrp)) + return false; + + /* a domain w/ threaded children is a thread root */ + if (cgrp->nr_threaded_children) + return true; + + /* + * A domain which has tasks and explicit threaded controllers + * enabled is a thread root. + */ + if (cgroup_has_tasks(cgrp) && + (cgrp->subtree_control & cgrp_dfl_threaded_ss_mask)) + return true; + + return false; +} + +/* a domain which isn't connected to the root w/o brekage can't be used */ +static bool cgroup_is_valid_domain(struct cgroup *cgrp) +{ + /* the cgroup itself can be a thread root */ + if (cgroup_is_threaded(cgrp)) + return false; + + /* but the ancestors can't be unless mixable */ + while ((cgrp = cgroup_parent(cgrp))) { + if (!cgroup_is_mixable(cgrp) && cgroup_is_thread_root(cgrp)) + return false; + if (cgroup_is_threaded(cgrp)) + return false; + } + + return true; +} + /* subsystems visibly enabled on a cgroup */ static u16 cgroup_control(struct cgroup *cgrp) { struct cgroup *parent = cgroup_parent(cgrp); u16 root_ss_mask = cgrp->root->subsys_mask; - if (parent) - return parent->subtree_control; + if (parent) { + u16 ss_mask = parent->subtree_control; + + /* threaded cgroups can only have threaded controllers */ + if (cgroup_is_threaded(cgrp)) + ss_mask &= cgrp_dfl_threaded_ss_mask; + return ss_mask; + } if (cgroup_on_dfl(cgrp)) root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask | @@ -355,8 +437,14 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp) { struct cgroup *parent = cgroup_parent(cgrp); - if (parent) - return parent->subtree_ss_mask; + if (parent) { + u16 ss_mask = parent->subtree_ss_mask; + + /* threaded cgroups can only have threaded controllers */ + if (cgroup_is_threaded(cgrp)) + ss_mask &= cgrp_dfl_threaded_ss_mask; + return ss_mask; + } return cgrp->root->subsys_mask; } @@ -2237,17 +2325,40 @@ out_release_tset: } /** - * cgroup_may_migrate_to - verify whether a cgroup can be migration destination + * cgroup_migrate_vet_dst - verify whether a cgroup can be migration destination * @dst_cgrp: destination cgroup to test * - * On the default hierarchy, except for the root, subtree_control must be - * zero for migration destination cgroups with tasks so that child cgroups - * don't compete against tasks. + * On the default hierarchy, except for the mixable, (possible) thread root + * and threaded cgroups, subtree_control must be zero for migration + * destination cgroups with tasks so that child cgroups don't compete + * against tasks. */ -bool cgroup_may_migrate_to(struct cgroup *dst_cgrp) +int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp) { - return !cgroup_on_dfl(dst_cgrp) || !cgroup_parent(dst_cgrp) || - !dst_cgrp->subtree_control; + /* v1 doesn't have any restriction */ + if (!cgroup_on_dfl(dst_cgrp)) + return 0; + + /* verify @dst_cgrp can host resources */ + if (!cgroup_is_valid_domain(dst_cgrp->dom_cgrp)) + return -EOPNOTSUPP; + + /* mixables don't care */ + if (cgroup_is_mixable(dst_cgrp)) + return 0; + + /* + * If @dst_cgrp is already or can become a thread root or is + * threaded, it doesn't matter. + */ + if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp)) + return 0; + + /* apply no-internal-process constraint */ + if (dst_cgrp->subtree_control) + return -EBUSY; + + return 0; } /** @@ -2452,8 +2563,9 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, struct task_struct *task; int ret; - if (!cgroup_may_migrate_to(dst_cgrp)) - return -EBUSY; + ret = cgroup_migrate_vet_dst(dst_cgrp); + if (ret) + return ret; /* look up all src csets */ spin_lock_irq(&css_set_lock); @@ -2881,6 +2993,46 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret) cgroup_apply_control_disable(cgrp); } +static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable) +{ + u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask; + + /* if nothing is getting enabled, nothing to worry about */ + if (!enable) + return 0; + + /* can @cgrp host any resources? */ + if (!cgroup_is_valid_domain(cgrp->dom_cgrp)) + return -EOPNOTSUPP; + + /* mixables don't care */ + if (cgroup_is_mixable(cgrp)) + return 0; + + if (domain_enable) { + /* can't enable domain controllers inside a thread subtree */ + if (cgroup_is_thread_root(cgrp) || cgroup_is_threaded(cgrp)) + return -EOPNOTSUPP; + } else { + /* + * Threaded controllers can handle internal competitions + * and are always allowed inside a (prospective) thread + * subtree. + */ + if (cgroup_can_be_thread_root(cgrp) || cgroup_is_threaded(cgrp)) + return 0; + } + + /* + * Controllers can't be enabled for a cgroup with tasks to avoid + * child cgroups competing against tasks. + */ + if (cgroup_has_tasks(cgrp)) + return -EBUSY; + + return 0; +} + /* change the enabled child controllers for a cgroup in the default hierarchy */ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of, char *buf, size_t nbytes, @@ -2956,14 +3108,9 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of, goto out_unlock; } - /* - * Except for the root, subtree_control must be zero for a cgroup - * with tasks so that child cgroups don't compete against tasks. - */ - if (enable && cgroup_parent(cgrp) && cgroup_has_tasks(cgrp)) { - ret = -EBUSY; + ret = cgroup_vet_subtree_control_enable(cgrp, enable); + if (ret) goto out_unlock; - } /* save and update control masks and prepare csses */ cgroup_save_control(cgrp); @@ -2982,6 +3129,84 @@ out_unlock: return ret ?: nbytes; } +static int cgroup_enable_threaded(struct cgroup *cgrp) +{ + struct cgroup *parent = cgroup_parent(cgrp); + struct cgroup *dom_cgrp = parent->dom_cgrp; + int ret; + + lockdep_assert_held(&cgroup_mutex); + + /* noop if already threaded */ + if (cgroup_is_threaded(cgrp)) + return 0; + + /* we're joining the parent's domain, ensure its validity */ + if (!cgroup_is_valid_domain(dom_cgrp) || + !cgroup_can_be_thread_root(dom_cgrp)) + return -EOPNOTSUPP; + + /* + * Allow enabling thread mode only on empty cgroups to avoid + * implicit migrations and recursive operations. + */ + if (cgroup_has_tasks(cgrp) || css_has_online_children(&cgrp->self)) + return -EBUSY; + + /* + * The following shouldn't cause actual migrations and should + * always succeed. + */ + cgroup_save_control(cgrp); + + cgrp->dom_cgrp = dom_cgrp; + ret = cgroup_apply_control(cgrp); + if (!ret) + parent->nr_threaded_children++; + else + cgrp->dom_cgrp = cgrp; + + cgroup_finalize_control(cgrp, ret); + return ret; +} + +static int cgroup_type_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + + if (cgroup_is_threaded(cgrp)) + seq_puts(seq, "threaded\n"); + else if (!cgroup_is_valid_domain(cgrp)) + seq_puts(seq, "domain invalid\n"); + else if (cgroup_is_thread_root(cgrp)) + seq_puts(seq, "domain threaded\n"); + else + seq_puts(seq, "domain\n"); + + return 0; +} + +static ssize_t cgroup_type_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct cgroup *cgrp; + int ret; + + /* only switching to threaded mode is supported */ + if (strcmp(strstrip(buf), "threaded")) + return -EINVAL; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENOENT; + + /* threaded can only be enabled */ + ret = cgroup_enable_threaded(cgrp); + + cgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + static int cgroup_events_show(struct seq_file *seq, void *v) { seq_printf(seq, "populated %d\n", @@ -3867,12 +4092,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos) return css_task_iter_next(it); } -static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) +static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos, + unsigned int iter_flags) { struct kernfs_open_file *of = s->private; struct cgroup *cgrp = seq_css(s)->cgroup; struct css_task_iter *it = of->priv; - unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED; /* * When a seq_file is seeked, it's always traversed sequentially @@ -3895,6 +4120,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) return cgroup_procs_next(s, NULL, NULL); } +static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) +{ + struct cgroup *cgrp = seq_css(s)->cgroup; + + /* + * All processes of a threaded subtree belong to the domain cgroup + * of the subtree. Only threads can be distributed across the + * subtree. Reject reads on cgroup.procs in the subtree proper. + * They're always empty anyway. + */ + if (cgroup_is_threaded(cgrp)) + return ERR_PTR(-EOPNOTSUPP); + + return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS | + CSS_TASK_ITER_THREADED); +} + static int cgroup_procs_show(struct seq_file *s, void *v) { seq_printf(s, "%d\n", task_pid_vnr(v)); @@ -3974,8 +4216,63 @@ out_unlock: return ret ?: nbytes; } +static void *cgroup_threads_start(struct seq_file *s, loff_t *pos) +{ + return __cgroup_procs_start(s, pos, 0); +} + +static ssize_t cgroup_threads_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct cgroup *src_cgrp, *dst_cgrp; + struct task_struct *task; + ssize_t ret; + + buf = strstrip(buf); + + dst_cgrp = cgroup_kn_lock_live(of->kn, false); + if (!dst_cgrp) + return -ENODEV; + + task = cgroup_procs_write_start(buf, false); + ret = PTR_ERR_OR_ZERO(task); + if (ret) + goto out_unlock; + + /* find the source cgroup */ + spin_lock_irq(&css_set_lock); + src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); + spin_unlock_irq(&css_set_lock); + + /* thread migrations follow the cgroup.procs delegation rule */ + ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, + of->file->f_path.dentry->d_sb); + if (ret) + goto out_finish; + + /* and must be contained in the same domain */ + ret = -EOPNOTSUPP; + if (src_cgrp->dom_cgrp != dst_cgrp->dom_cgrp) + goto out_finish; + + ret = cgroup_attach_task(dst_cgrp, task, false); + +out_finish: + cgroup_procs_write_finish(task); +out_unlock: + cgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + /* cgroup core interface files for the default hierarchy */ static struct cftype cgroup_base_files[] = { + { + .name = "cgroup.type", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_type_show, + .write = cgroup_type_write, + }, { .name = "cgroup.procs", .flags = CFTYPE_NS_DELEGATABLE, @@ -3986,6 +4283,14 @@ static struct cftype cgroup_base_files[] = { .seq_show = cgroup_procs_show, .write = cgroup_procs_write, }, + { + .name = "cgroup.threads", + .release = cgroup_procs_release, + .seq_start = cgroup_threads_start, + .seq_next = cgroup_procs_next, + .seq_show = cgroup_procs_show, + .write = cgroup_threads_write, + }, { .name = "cgroup.controllers", .seq_show = cgroup_controllers_show, @@ -4753,11 +5058,17 @@ int __init cgroup_init(void) cgrp_dfl_root.subsys_mask |= 1 << ss->id; + /* implicit controllers must be threaded too */ + WARN_ON(ss->implicit_on_dfl && !ss->threaded); + if (ss->implicit_on_dfl) cgrp_dfl_implicit_ss_mask |= 1 << ss->id; else if (!ss->dfl_cftypes) cgrp_dfl_inhibit_ss_mask |= 1 << ss->id; + if (ss->threaded) + cgrp_dfl_threaded_ss_mask |= 1 << ss->id; + if (ss->dfl_cftypes == ss->legacy_cftypes) { WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes)); } else { diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c index dac46af22782..787a242fa69d 100644 --- a/kernel/cgroup/debug.c +++ b/kernel/cgroup/debug.c @@ -352,6 +352,7 @@ static int __init enable_cgroup_debug(char *str) { debug_cgrp_subsys.dfl_cftypes = debug_files; debug_cgrp_subsys.implicit_on_dfl = true; + debug_cgrp_subsys.threaded = true; return 1; } __setup("cgroup_debug", enable_cgroup_debug); diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 2237201d66d5..9829c67ebc0a 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = { .free = pids_free, .legacy_cftypes = pids_files, .dfl_cftypes = pids_files, + .threaded = true, }; diff --git a/kernel/events/core.c b/kernel/events/core.c index 1538df9b2b65..ec78247da310 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11210,5 +11210,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = { * controller is not mounted on a legacy hierarchy. */ .implicit_on_dfl = true, + .threaded = true, }; #endif /* CONFIG_CGROUP_PERF */ -- cgit v1.2.3 From 918a8c2c4ea4fab8b7855b8da48bbaf0a733ebb0 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 23 Jul 2017 08:18:26 -0400 Subject: cgroup: remove unnecessary empty check when enabling threaded mode cgroup_enable_threaded() checks that the cgroup doesn't have any tasks or children and fails the operation if so. This test is unnecessary because the first part is already checked by cgroup_can_be_thread_root() and the latter is unnecessary. The latter actually cause a behavioral oddity. Please consider the following hierarchy. All cgroups are domains. A / \ B C \ D If B is made threaded, C and D becomes invalid domains. Due to the no children restriction, threaded mode can't be enabled on C. For C and D, the only thing the user can do is removal. There is no reason for this restriction. Remove it. Acked-by: Waiman Long Signed-off-by: Tejun Heo --- Documentation/cgroup-v2.txt | 5 +++-- kernel/cgroup/cgroup.c | 7 ------- 2 files changed, 3 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index cb9ea281ab72..dec5afdaa36d 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -274,8 +274,9 @@ thread mode, the following conditions must be met. - As the cgroup will join the parent's resource domain. The parent must either be a valid (threaded) domain or a threaded cgroup. -- The cgroup must be empty. No enabled controllers, child cgroups or - processes. +- When the parent is an unthreaded domain, it must not have any domain + controllers enabled or populated domain children. The root is + exempt from this requirement. Topology-wise, a cgroup can be in an invalid state. Please consider the following toplogy:: diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index e9a377dc5bdb..e0a558c4d358 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3146,13 +3146,6 @@ static int cgroup_enable_threaded(struct cgroup *cgrp) !cgroup_can_be_thread_root(dom_cgrp)) return -EOPNOTSUPP; - /* - * Allow enabling thread mode only on empty cgroups to avoid - * implicit migrations and recursive operations. - */ - if (cgroup_has_tasks(cgrp) || css_has_online_children(&cgrp->self)) - return -EBUSY; - /* * The following shouldn't cause actual migrations and should * always succeed. -- cgit v1.2.3 From 1a926e0bbab83bae8207d05a533173425e0496d1 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 28 Jul 2017 18:28:44 +0100 Subject: cgroup: implement hierarchy limits Creating cgroup hierearchies of unreasonable size can affect overall system performance. A user might want to limit the size of cgroup hierarchy. This is especially important if a user is delegating some cgroup sub-tree. To address this issue, introduce an ability to control the size of cgroup hierarchy. The cgroup.max.descendants control file allows to set the maximum allowed number of descendant cgroups. The cgroup.max.depth file controls the maximum depth of the cgroup tree. Both are single value r/w files, with "max" default value. The control files exist on each hierarchy level (including root). When a new cgroup is created, we check the total descendants and depth limits on each level, and if none of them are exceeded, a new cgroup is created. Only alive cgroups are counted, removed (dying) cgroups are ignored. Signed-off-by: Roman Gushchin Suggested-by: Tejun Heo Signed-off-by: Tejun Heo Cc: Zefan Li Cc: Waiman Long Cc: Johannes Weiner Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- Documentation/cgroup-v2.txt | 14 +++++ include/linux/cgroup-defs.h | 5 ++ kernel/cgroup/cgroup.c | 126 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 145 insertions(+) (limited to 'Documentation') diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index dec5afdaa36d..46ec3f76211c 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -854,6 +854,20 @@ All cgroup core files are prefixed with "cgroup." 1 if the cgroup or its descendants contains any live processes; otherwise, 0. + cgroup.max.descendants + A read-write single value files. The default is "max". + + Maximum allowed number of descent cgroups. + If the actual number of descendants is equal or larger, + an attempt to create a new cgroup in the hierarchy will fail. + + cgroup.max.depth + A read-write single value files. The default is "max". + + Maximum allowed descent depth below the current cgroup. + If the actual descent depth is equal or larger, + an attempt to create a new child cgroup will fail. + Controllers =========== diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 58b4c425a155..59e4ad9e7bac 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -273,13 +273,18 @@ struct cgroup { */ int level; + /* Maximum allowed descent tree depth */ + int max_depth; + /* * Keep track of total numbers of visible and dying descent cgroups. * Dying cgroups are cgroups which were deleted by a user, * but are still existing because someone else is holding a reference. + * max_descendants is a maximum allowed number of descent cgroups. */ int nr_descendants; int nr_dying_descendants; + int max_descendants; /* * Each non-empty css_set associated with this cgroup contributes diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index cfdbb1e780de..0fd9134e1720 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1827,6 +1827,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) cgrp->self.cgroup = cgrp; cgrp->self.flags |= CSS_ONLINE; cgrp->dom_cgrp = cgrp; + cgrp->max_descendants = INT_MAX; + cgrp->max_depth = INT_MAX; for_each_subsys(ss, ssid) INIT_LIST_HEAD(&cgrp->e_csets[ssid]); @@ -3209,6 +3211,92 @@ static ssize_t cgroup_type_write(struct kernfs_open_file *of, char *buf, return ret ?: nbytes; } +static int cgroup_max_descendants_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + int descendants = READ_ONCE(cgrp->max_descendants); + + if (descendants == INT_MAX) + seq_puts(seq, "max\n"); + else + seq_printf(seq, "%d\n", descendants); + + return 0; +} + +static ssize_t cgroup_max_descendants_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct cgroup *cgrp; + int descendants; + ssize_t ret; + + buf = strstrip(buf); + if (!strcmp(buf, "max")) { + descendants = INT_MAX; + } else { + ret = kstrtoint(buf, 0, &descendants); + if (ret) + return ret; + } + + if (descendants < 0 || descendants > INT_MAX) + return -ERANGE; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENOENT; + + cgrp->max_descendants = descendants; + + cgroup_kn_unlock(of->kn); + + return nbytes; +} + +static int cgroup_max_depth_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + int depth = READ_ONCE(cgrp->max_depth); + + if (depth == INT_MAX) + seq_puts(seq, "max\n"); + else + seq_printf(seq, "%d\n", depth); + + return 0; +} + +static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct cgroup *cgrp; + ssize_t ret; + int depth; + + buf = strstrip(buf); + if (!strcmp(buf, "max")) { + depth = INT_MAX; + } else { + ret = kstrtoint(buf, 0, &depth); + if (ret) + return ret; + } + + if (depth < 0 || depth > INT_MAX) + return -ERANGE; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENOENT; + + cgrp->max_depth = depth; + + cgroup_kn_unlock(of->kn); + + return nbytes; +} + static int cgroup_events_show(struct seq_file *seq, void *v) { seq_printf(seq, "populated %d\n", @@ -4309,6 +4397,16 @@ static struct cftype cgroup_base_files[] = { .file_offset = offsetof(struct cgroup, events_file), .seq_show = cgroup_events_show, }, + { + .name = "cgroup.max.descendants", + .seq_show = cgroup_max_descendants_show, + .write = cgroup_max_descendants_write, + }, + { + .name = "cgroup.max.depth", + .seq_show = cgroup_max_depth_show, + .write = cgroup_max_depth_write, + }, { } /* terminate */ }; @@ -4662,6 +4760,29 @@ out_free_cgrp: return ERR_PTR(ret); } +static bool cgroup_check_hierarchy_limits(struct cgroup *parent) +{ + struct cgroup *cgroup; + int ret = false; + int level = 1; + + lockdep_assert_held(&cgroup_mutex); + + for (cgroup = parent; cgroup; cgroup = cgroup_parent(cgroup)) { + if (cgroup->nr_descendants >= cgroup->max_descendants) + goto fail; + + if (level > cgroup->max_depth) + goto fail; + + level++; + } + + ret = true; +fail: + return ret; +} + int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) { struct cgroup *parent, *cgrp; @@ -4676,6 +4797,11 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) if (!parent) return -ENODEV; + if (!cgroup_check_hierarchy_limits(parent)) { + ret = -EAGAIN; + goto out_unlock; + } + cgrp = cgroup_create(parent); if (IS_ERR(cgrp)) { ret = PTR_ERR(cgrp); -- cgit v1.2.3 From ec39225cca42c05ac36853d11d28f877fde5c42e Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Wed, 2 Aug 2017 17:55:31 +0100 Subject: cgroup: add cgroup.stat interface with basic hierarchy stats A cgroup can consume resources even after being deleted by a user. For example, writing back dirty pages should be accounted and limited, despite the corresponding cgroup might contain no processes and being deleted by a user. In the current implementation a cgroup can remain in such "dying" state for an undefined amount of time. For instance, if a memory cgroup contains a pge, mlocked by a process belonging to an other cgroup. Although the lifecycle of a dying cgroup is out of user's control, it's important to have some insight of what's going on under the hood. In particular, it's handy to have a counter which will allow to detect css leaks. To solve this problem, add a cgroup.stat interface to the base cgroup control files with the following metrics: nr_descendants total number of visible descendant cgroups nr_dying_descendants total number of dying descendant cgroups Signed-off-by: Roman Gushchin Suggested-by: Tejun Heo Signed-off-by: Tejun Heo Cc: Zefan Li Cc: Waiman Long Cc: Johannes Weiner Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- Documentation/cgroup-v2.txt | 18 ++++++++++++++++++ kernel/cgroup/cgroup.c | 16 ++++++++++++++++ 2 files changed, 34 insertions(+) (limited to 'Documentation') diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 46ec3f76211c..dc44785dc0fa 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -868,6 +868,24 @@ All cgroup core files are prefixed with "cgroup." If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. + cgroup.stat + A read-only flat-keyed file with the following entries: + + nr_descendants + Total number of visible descendant cgroups. + + nr_dying_descendants + Total number of dying descendant cgroups. A cgroup becomes + dying after being deleted by a user. The cgroup will remain + in dying state for some time undefined time (which can depend + on system load) before being completely destroyed. + + A process can't enter a dying cgroup under any circumstances, + a dying cgroup can't revive. + + A dying cgroup can consume system resources not exceeding + limits, which were active at the moment of cgroup deletion. + Controllers =========== diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 0fd9134e1720..a06755a610e1 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3304,6 +3304,18 @@ static int cgroup_events_show(struct seq_file *seq, void *v) return 0; } +static int cgroup_stats_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgroup = seq_css(seq)->cgroup; + + seq_printf(seq, "nr_descendants %d\n", + cgroup->nr_descendants); + seq_printf(seq, "nr_dying_descendants %d\n", + cgroup->nr_dying_descendants); + + return 0; +} + static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of->kn->priv; @@ -4407,6 +4419,10 @@ static struct cftype cgroup_base_files[] = { .seq_show = cgroup_max_depth_show, .write = cgroup_max_depth_write, }, + { + .name = "cgroup.stat", + .seq_show = cgroup_stats_show, + }, { } /* terminate */ }; -- cgit v1.2.3