From e7186af7fb2609584a8bfb3da3c6ae09da5a5224 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Thu, 13 Jul 2023 09:26:05 -0400 Subject: tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure For backward compatibility, older tooling expects to see the kernel_stack event with a "caller" field that is a fixed size array of 8 addresses. The code now supports more than 8 with an added "size" field that states the real number of entries. But the "caller" field still just looks like a fixed size to user space. Since the tracing macros that create the user space format files also creates the structures that those files represent, the kernel_stack event structure had its "caller" field a fixed size of 8, but in reality, when it is allocated on the ring buffer, it can hold more if the stack trace is bigger that 8 functions. The copying of these entries was simply done with a memcpy(): size = nr_entries * sizeof(unsigned long); memcpy(entry->caller, fstack->calls, size); The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was larger than 8, that the memcpy() was writing more than what the structure stated it can hold and it complained about it. This is because the FORTIFY_SOURCE code is unaware that the amount allocated is actually enough to hold the size. It does not expect that a fixed size field will hold more than the fixed size. This was originally solved by hiding the caller assignment with some pointer arithmetic. ptr = ring_buffer_data(); entry = ptr; ptr += offsetof(typeof(*entry), caller); memcpy(ptr, fstack->calls, size); But it is considered bad form to hide from kernel hardening. Instead, make it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro that is specific for this one special use case. The macro will take 4 arguments: type, item, len, field (whereas the __array() macro takes just the first three). This macro will act just like the __array() macro when creating the code to deal with the format file that is exposed to user space. But for the kernel, it will turn the caller field into: type item[] __counted_by(field); or for this instance: unsigned long caller[] __counted_by(size); Now the kernel code can expose the assignment of the caller to the FORTIFY_SOURCE and everyone is happy! Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Sven Schnelle Suggested-by: Kees Cook Signed-off-by: Steven Rostedt (Google) Reviewed-by: Kees Cook --- kernel/trace/trace.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel/trace/trace.h') diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index e1edc2197fc8..ba7ababb8308 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -77,6 +77,16 @@ enum trace_type { #undef __array #define __array(type, item, size) type item[size]; +/* + * For backward compatibility, older user space expects to see the + * kernel_stack event with a fixed size caller field. But today the fix + * size is ignored by the kernel, and the real structure is dynamic. + * Expose to user space: "unsigned long caller[8];" but the real structure + * will be "unsigned long caller[] __counted_by(size)" + */ +#undef __stack_array +#define __stack_array(type, item, size, field) type item[] __counted_by(field); + #undef __array_desc #define __array_desc(type, container, item, size) -- cgit v1.2.3 From 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35 Mon Sep 17 00:00:00 2001 From: Ajay Kaher Date: Fri, 28 Jul 2023 23:50:51 +0530 Subject: eventfs: Move tracing/events to eventfs Up until now, /sys/kernel/tracing/events was no different than any other part of tracefs. The files and directories within the events directory was created when the tracefs was mounted, and also created for the instances in /sys/kernel/tracing/instances//events. Most of these files and directories will never be referenced. Since there are thousands of these files and directories they spend their time wasting precious memory resources. Move the "events" directory to the new eventfs. The eventfs will take the meta data of the events that they represent and store that. When the files in the events directory are referenced, the dentry and inodes to represent them are then created. When the files are no longer referenced, they are freed. This saves the precious memory resources that were wasted on these seldom referenced dentries and inodes. Running the following: ~# cat /proc/meminfo /proc/slabinfo > before.out ~# mkdir /sys/kernel/tracing/instances/foo ~# cat /proc/meminfo /proc/slabinfo > after.out to test the changes produces the following deltas: Before this change: Before after deltas for meminfo: MemFree: -32260 MemAvailable: -21496 KReclaimable: 21528 Slab: 22440 SReclaimable: 21528 SUnreclaim: 912 VmallocUsed: 16 Before after deltas for slabinfo: : [ * = ] tracefs_inode_cache: 14472 [* 1184 = 17134848] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 28 [* 1480 = 41440] dentry: 14450 [* 312 = 4508400] lsm_inode_cache: 14453 [* 32 = 462496] vma_lock: 11 [* 152 = 1672] vm_area_struct: 2 [* 184 = 368] trace_event_file: 1748 [* 88 = 153824] kmalloc-256: 1072 [* 256 = 274432] kmalloc-64: 2842 [* 64 = 181888] Total slab additions in size: 22,763,400 bytes With this change: Before after deltas for meminfo: MemFree: -12600 MemAvailable: -12580 Cached: 24 Active: 12 Inactive: 68 Inactive(anon): 48 Active(file): 12 Inactive(file): 20 Dirty: -4 AnonPages: 68 KReclaimable: 12 Slab: 1856 SReclaimable: 12 SUnreclaim: 1844 KernelStack: 16 PageTables: 36 VmallocUsed: 16 Before after deltas for slabinfo: : [ * = ] tracefs_inode_cache: 108 [* 1184 = 127872] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 18 [* 1480 = 26640] dentry: 127 [* 312 = 39624] lsm_inode_cache: 152 [* 32 = 4864] vma_lock: 67 [* 152 = 10184] vm_area_struct: -12 [* 184 = -2208] trace_event_file: 1764 [* 96 = 169344] kmalloc-96: 14322 [* 96 = 1374912] kmalloc-64: 2814 [* 64 = 180096] kmalloc-32: 1103 [* 32 = 35296] kmalloc-16: 2308 [* 16 = 36928] kmalloc-8: 12800 [* 8 = 102400] Total slab additions in size: 2,109,984 bytes Which is a savings of 20,653,416 bytes (20 MB) per tracing instance. Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com Signed-off-by: Ajay Kaher Co-developed-by: Steven Rostedt (VMware) Signed-off-by: Steven Rostedt (VMware) Tested-by: Ching-lin Yu Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 18 ++++++++++++ include/linux/trace_events.h | 1 + kernel/trace/trace.h | 2 +- kernel/trace/trace_events.c | 65 ++++++++++++++++++++++---------------------- 4 files changed, 53 insertions(+), 33 deletions(-) (limited to 'kernel/trace/trace.h') diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index d9273066f25f..bb6de89eb446 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -374,6 +374,23 @@ static const struct super_operations tracefs_super_operations = { .show_options = tracefs_show_options, }; +static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode) +{ + struct tracefs_inode *ti; + + if (!dentry || !inode) + return; + + ti = get_tracefs(inode); + if (ti && ti->flags & TRACEFS_EVENT_INODE) + eventfs_set_ef_status_free(dentry); + iput(inode); +} + +static const struct dentry_operations tracefs_dentry_operations = { + .d_iput = tracefs_dentry_iput, +}; + static int trace_fill_super(struct super_block *sb, void *data, int silent) { static const struct tree_descr trace_files[] = {{""}}; @@ -396,6 +413,7 @@ static int trace_fill_super(struct super_block *sb, void *data, int silent) goto fail; sb->s_op = &tracefs_super_operations; + sb->s_d_op = &tracefs_dentry_operations; tracefs_apply_options(sb, false); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 3930e676436c..c17623c78029 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -638,6 +638,7 @@ struct trace_event_file { struct list_head list; struct trace_event_call *event_call; struct event_filter __rcu *filter; + struct eventfs_file *ef; struct dentry *dir; struct trace_array *tr; struct trace_subsystem_dir *system; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index ba7ababb8308..5b1f9e24764a 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1334,7 +1334,7 @@ struct trace_subsystem_dir { struct list_head list; struct event_subsystem *subsystem; struct trace_array *tr; - struct dentry *entry; + struct eventfs_file *ef; int ref_count; int nr_events; }; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 3ecc41f6acd9..ed367d713be0 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -984,7 +984,7 @@ static void remove_subsystem(struct trace_subsystem_dir *dir) return; if (!--dir->nr_events) { - tracefs_remove(dir->entry); + eventfs_remove(dir->ef); list_del(&dir->list); __put_system_dir(dir); } @@ -1005,7 +1005,7 @@ static void remove_event_file_dir(struct trace_event_file *file) tracefs_remove(dir); } - + eventfs_remove(file->ef); list_del(&file->list); remove_subsystem(file->system); free_event_filter(file->filter); @@ -2291,13 +2291,13 @@ create_new_subsystem(const char *name) return NULL; } -static struct dentry * +static struct eventfs_file * event_subsystem_dir(struct trace_array *tr, const char *name, struct trace_event_file *file, struct dentry *parent) { struct event_subsystem *system, *iter; struct trace_subsystem_dir *dir; - struct dentry *entry; + int res; /* First see if we did not already create this dir */ list_for_each_entry(dir, &tr->systems, list) { @@ -2305,7 +2305,7 @@ event_subsystem_dir(struct trace_array *tr, const char *name, if (strcmp(system->name, name) == 0) { dir->nr_events++; file->system = dir; - return dir->entry; + return dir->ef; } } @@ -2329,8 +2329,8 @@ event_subsystem_dir(struct trace_array *tr, const char *name, } else __get_system(system); - dir->entry = tracefs_create_dir(name, parent); - if (!dir->entry) { + dir->ef = eventfs_add_subsystem_dir(name, parent); + if (IS_ERR(dir->ef)) { pr_warn("Failed to create system directory %s\n", name); __put_system(system); goto out_free; @@ -2345,22 +2345,22 @@ event_subsystem_dir(struct trace_array *tr, const char *name, /* the ftrace system is special, do not create enable or filter files */ if (strcmp(name, "ftrace") != 0) { - entry = tracefs_create_file("filter", TRACE_MODE_WRITE, - dir->entry, dir, + res = eventfs_add_file("filter", TRACE_MODE_WRITE, + dir->ef, dir, &ftrace_subsystem_filter_fops); - if (!entry) { + if (res) { kfree(system->filter); system->filter = NULL; pr_warn("Could not create tracefs '%s/filter' entry\n", name); } - trace_create_file("enable", TRACE_MODE_WRITE, dir->entry, dir, + eventfs_add_file("enable", TRACE_MODE_WRITE, dir->ef, dir, &ftrace_system_enable_fops); } list_add(&dir->list, &tr->systems); - return dir->entry; + return dir->ef; out_free: kfree(dir); @@ -2413,8 +2413,8 @@ static int event_create_dir(struct dentry *parent, struct trace_event_file *file) { struct trace_event_call *call = file->event_call; + struct eventfs_file *ef_subsystem = NULL; struct trace_array *tr = file->tr; - struct dentry *d_events; const char *name; int ret; @@ -2426,24 +2426,24 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file) if (WARN_ON_ONCE(strcmp(call->class->system, TRACE_SYSTEM) == 0)) return -ENODEV; - d_events = event_subsystem_dir(tr, call->class->system, file, parent); - if (!d_events) + ef_subsystem = event_subsystem_dir(tr, call->class->system, file, parent); + if (!ef_subsystem) return -ENOMEM; name = trace_event_name(call); - file->dir = tracefs_create_dir(name, d_events); - if (!file->dir) { + file->ef = eventfs_add_dir(name, ef_subsystem); + if (IS_ERR(file->ef)) { pr_warn("Could not create tracefs '%s' directory\n", name); return -1; } if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) - trace_create_file("enable", TRACE_MODE_WRITE, file->dir, file, + eventfs_add_file("enable", TRACE_MODE_WRITE, file->ef, file, &ftrace_enable_fops); #ifdef CONFIG_PERF_EVENTS if (call->event.type && call->class->reg) - trace_create_file("id", TRACE_MODE_READ, file->dir, + eventfs_add_file("id", TRACE_MODE_READ, file->ef, (void *)(long)call->event.type, &ftrace_event_id_fops); #endif @@ -2459,27 +2459,27 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file) * triggers or filters. */ if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) { - trace_create_file("filter", TRACE_MODE_WRITE, file->dir, + eventfs_add_file("filter", TRACE_MODE_WRITE, file->ef, file, &ftrace_event_filter_fops); - trace_create_file("trigger", TRACE_MODE_WRITE, file->dir, + eventfs_add_file("trigger", TRACE_MODE_WRITE, file->ef, file, &event_trigger_fops); } #ifdef CONFIG_HIST_TRIGGERS - trace_create_file("hist", TRACE_MODE_READ, file->dir, file, + eventfs_add_file("hist", TRACE_MODE_READ, file->ef, file, &event_hist_fops); #endif #ifdef CONFIG_HIST_TRIGGERS_DEBUG - trace_create_file("hist_debug", TRACE_MODE_READ, file->dir, file, + eventfs_add_file("hist_debug", TRACE_MODE_READ, file->ef, file, &event_hist_debug_fops); #endif - trace_create_file("format", TRACE_MODE_READ, file->dir, call, + eventfs_add_file("format", TRACE_MODE_READ, file->ef, call, &ftrace_event_format_fops); #ifdef CONFIG_TRACE_EVENT_INJECT if (call->event.type && call->class->reg) - trace_create_file("inject", 0200, file->dir, file, + eventfs_add_file("inject", 0200, file->ef, file, &event_inject_fops); #endif @@ -3632,21 +3632,22 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) { struct dentry *d_events; struct dentry *entry; + int error = 0; entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent, tr, &ftrace_set_event_fops); if (!entry) return -ENOMEM; - d_events = tracefs_create_dir("events", parent); - if (!d_events) { + d_events = eventfs_create_events_dir("events", parent); + if (IS_ERR(d_events)) { pr_warn("Could not create tracefs 'events' directory\n"); return -ENOMEM; } - entry = trace_create_file("enable", TRACE_MODE_WRITE, d_events, + error = eventfs_add_events_file("enable", TRACE_MODE_WRITE, d_events, tr, &ftrace_tr_enable_fops); - if (!entry) + if (error) return -ENOMEM; /* There are not as crucial, just warn if they are not created */ @@ -3659,11 +3660,11 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) &ftrace_set_event_notrace_pid_fops); /* ring buffer internal formats */ - trace_create_file("header_page", TRACE_MODE_READ, d_events, + eventfs_add_events_file("header_page", TRACE_MODE_READ, d_events, ring_buffer_print_page_header, &ftrace_show_header_fops); - trace_create_file("header_event", TRACE_MODE_READ, d_events, + eventfs_add_events_file("header_event", TRACE_MODE_READ, d_events, ring_buffer_print_entry_header, &ftrace_show_header_fops); @@ -3751,7 +3752,7 @@ int event_trace_del_tracer(struct trace_array *tr) down_write(&trace_event_sem); __trace_remove_event_dirs(tr); - tracefs_remove(tr->event_dir); + eventfs_remove_events_dir(tr->event_dir); up_write(&trace_event_sem); tr->event_dir = NULL; -- cgit v1.2.3 From efde97a175e89fceb36cd2c352700661183992bc Mon Sep 17 00:00:00 2001 From: Yue Haibing Date: Thu, 3 Aug 2023 22:40:28 +0800 Subject: tracing: Remove unused function declarations Commit 9457158bbc0e ("tracing: Fix reset of time stamps during trace_clock changes") left behind tracing_reset_current() declaration. Also commit 6954e415264e ("tracing: Place trace_pid_list logic into abstract functions") removed trace_free_pid_list() implementation but leave declaration. Link: https://lore.kernel.org/linux-trace-kernel/20230803144028.25492-1-yuehaibing@huawei.com Cc: Signed-off-by: Yue Haibing Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel/trace/trace.h') diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 5b1f9e24764a..b6e44a39b4ce 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -604,7 +604,6 @@ trace_buffer_iter(struct trace_iterator *iter, int cpu) int tracer_init(struct tracer *t, struct trace_array *tr); int tracing_is_enabled(void); void tracing_reset_online_cpus(struct array_buffer *buf); -void tracing_reset_current(int cpu); void tracing_reset_all_online_cpus(void); void tracing_reset_all_online_cpus_unlocked(void); int tracing_open_generic(struct inode *inode, struct file *filp); @@ -705,7 +704,6 @@ void trace_filter_add_remove_task(struct trace_pid_list *pid_list, void *trace_pid_next(struct trace_pid_list *pid_list, void *v, loff_t *pos); void *trace_pid_start(struct trace_pid_list *pid_list, loff_t *pos); int trace_pid_show(struct seq_file *m, void *v); -void trace_free_pid_list(struct trace_pid_list *pid_list); int trace_pid_write(struct trace_pid_list *filtered_pids, struct trace_pid_list **new_pid_list, const char __user *ubuf, size_t cnt); -- cgit v1.2.3