summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2016-10-06 21:48:41 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2016-10-06 21:48:41 +0300
commit95107b30be68953e3a4f1c3994c2233500502ccf (patch)
treeed1824ce842d9813f2ee8a5fa8683e0d6b13f662
parent541efb7632642cab55361178d73d544f025b593c (diff)
parenta0d0c6216afad4b2b1704a72bd76bea259e07655 (diff)
downloadlinux-95107b30be68953e3a4f1c3994c2233500502ccf.tar.xz
Merge tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt: "This release cycle is rather small. Just a few fixes to tracing. The big change is the addition of the hwlat tracer. It not only detects SMIs, but also other latency that's caused by the hardware. I have detected some latency from large boxes having bus contention" * tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Call traceoff trigger after event is recorded ftrace/scripts: Add helper script to bisect function tracing problem functions tracing: Have max_latency be defined for HWLAT_TRACER as well tracing: Add NMI tracing in hwlat detector tracing: Have hwlat trace migrate across tracing_cpumask CPUs tracing: Add documentation for hwlat_detector tracer tracing: Added hardware latency tracer ftrace: Access ret_stack->subtime only in the function profiler function_graph: Handle TRACE_BPUTS in print_graph_comment tracing/uprobe: Drop isdigit() check in create_trace_uprobe
-rw-r--r--Documentation/trace/ftrace.txt10
-rw-r--r--Documentation/trace/hwlat_detector.txt79
-rw-r--r--arch/sh/kernel/ftrace.c4
-rw-r--r--include/linux/ftrace.h2
-rw-r--r--include/linux/ftrace_irq.h31
-rw-r--r--kernel/trace/Kconfig35
-rw-r--r--kernel/trace/Makefile1
-rw-r--r--kernel/trace/ftrace.c6
-rw-r--r--kernel/trace/trace.c8
-rw-r--r--kernel/trace/trace.h5
-rw-r--r--kernel/trace/trace_entries.h27
-rw-r--r--kernel/trace/trace_events_trigger.c1
-rw-r--r--kernel/trace/trace_functions_graph.c6
-rw-r--r--kernel/trace/trace_hwlat.c633
-rw-r--r--kernel/trace/trace_output.c66
-rw-r--r--kernel/trace/trace_uprobe.c4
-rwxr-xr-xscripts/tracing/ftrace-bisect.sh115
17 files changed, 1013 insertions, 20 deletions
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index a6b3705e62a6..185c39fea2a0 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -858,11 +858,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
When enabled, it will account time the task has been
scheduled out as part of the function call.
- graph-time - When running function graph tracer, to include the
- time to call nested functions. When this is not set,
- the time reported for the function will only include
- the time the function itself executed for, not the time
- for functions that it called.
+ graph-time - When running function profiler with function graph tracer,
+ to include the time to call nested functions. When this is
+ not set, the time reported for the function will only
+ include the time the function itself executed for, not the
+ time for functions that it called.
record-cmd - When any event or tracer is enabled, a hook is enabled
in the sched_switch trace point to fill comm cache
diff --git a/Documentation/trace/hwlat_detector.txt b/Documentation/trace/hwlat_detector.txt
new file mode 100644
index 000000000000..3207717a0d1a
--- /dev/null
+++ b/Documentation/trace/hwlat_detector.txt
@@ -0,0 +1,79 @@
+Introduction:
+-------------
+
+The tracer hwlat_detector is a special purpose tracer that is used to
+detect large system latencies induced by the behavior of certain underlying
+hardware or firmware, independent of Linux itself. The code was developed
+originally to detect SMIs (System Management Interrupts) on x86 systems,
+however there is nothing x86 specific about this patchset. It was
+originally written for use by the "RT" patch since the Real Time
+kernel is highly latency sensitive.
+
+SMIs are not serviced by the Linux kernel, which means that it does not
+even know that they are occuring. SMIs are instead set up by BIOS code
+and are serviced by BIOS code, usually for "critical" events such as
+management of thermal sensors and fans. Sometimes though, SMIs are used for
+other tasks and those tasks can spend an inordinate amount of time in the
+handler (sometimes measured in milliseconds). Obviously this is a problem if
+you are trying to keep event service latencies down in the microsecond range.
+
+The hardware latency detector works by hogging one of the cpus for configurable
+amounts of time (with interrupts disabled), polling the CPU Time Stamp Counter
+for some period, then looking for gaps in the TSC data. Any gap indicates a
+time when the polling was interrupted and since the interrupts are disabled,
+the only thing that could do that would be an SMI or other hardware hiccup
+(or an NMI, but those can be tracked).
+
+Note that the hwlat detector should *NEVER* be used in a production environment.
+It is intended to be run manually to determine if the hardware platform has a
+problem with long system firmware service routines.
+
+Usage:
+------
+
+Write the ASCII text "hwlat" into the current_tracer file of the tracing system
+(mounted at /sys/kernel/tracing or /sys/kernel/tracing). It is possible to
+redefine the threshold in microseconds (us) above which latency spikes will
+be taken into account.
+
+Example:
+
+ # echo hwlat > /sys/kernel/tracing/current_tracer
+ # echo 100 > /sys/kernel/tracing/tracing_thresh
+
+The /sys/kernel/tracing/hwlat_detector interface contains the following files:
+
+width - time period to sample with CPUs held (usecs)
+ must be less than the total window size (enforced)
+window - total period of sampling, width being inside (usecs)
+
+By default the width is set to 500,000 and window to 1,000,000, meaning that
+for every 1,000,000 usecs (1s) the hwlat detector will spin for 500,000 usecs
+(0.5s). If tracing_thresh contains zero when hwlat tracer is enabled, it will
+change to a default of 10 usecs. If any latencies that exceed the threshold is
+observed then the data will be written to the tracing ring buffer.
+
+The minimum sleep time between periods is 1 millisecond. Even if width
+is less than 1 millisecond apart from window, to allow the system to not
+be totally starved.
+
+If tracing_thresh was zero when hwlat detector was started, it will be set
+back to zero if another tracer is loaded. Note, the last value in
+tracing_thresh that hwlat detector had will be saved and this value will
+be restored in tracing_thresh if it is still zero when hwlat detector is
+started again.
+
+The following tracing directory files are used by the hwlat_detector:
+
+in /sys/kernel/tracing:
+
+ tracing_threshold - minimum latency value to be considered (usecs)
+ tracing_max_latency - maximum hardware latency actually observed (usecs)
+ tracing_cpumask - the CPUs to move the hwlat thread across
+ hwlat_detector/width - specified amount of time to spin within window (usecs)
+ hwlat_detector/window - amount of time between (width) runs (usecs)
+
+The hwlat detector's kernel thread will migrate across each CPU specified in
+tracing_cpumask between each window. To limit the migration, either modify
+tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
+affinity directly, and the migration will stop.
diff --git a/arch/sh/kernel/ftrace.c b/arch/sh/kernel/ftrace.c
index 95eccd49672f..53783978162e 100644
--- a/arch/sh/kernel/ftrace.c
+++ b/arch/sh/kernel/ftrace.c
@@ -139,7 +139,7 @@ static void ftrace_mod_code(void)
clear_mod_flag();
}
-void ftrace_nmi_enter(void)
+void arch_ftrace_nmi_enter(void)
{
if (atomic_inc_return(&nmi_running) & MOD_CODE_WRITE_FLAG) {
smp_rmb();
@@ -150,7 +150,7 @@ void ftrace_nmi_enter(void)
smp_mb();
}
-void ftrace_nmi_exit(void)
+void arch_ftrace_nmi_exit(void)
{
/* Finish all executions before clearing nmi_running */
smp_mb();
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 6f93ac46e7f0..b3d34d3e0e7e 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -794,7 +794,9 @@ struct ftrace_ret_stack {
unsigned long ret;
unsigned long func;
unsigned long long calltime;
+#ifdef CONFIG_FUNCTION_PROFILER
unsigned long long subtime;
+#endif
#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
unsigned long fp;
#endif
diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index dca7bf8cffe2..4ec2c9b205f2 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -3,11 +3,34 @@
#ifdef CONFIG_FTRACE_NMI_ENTER
-extern void ftrace_nmi_enter(void);
-extern void ftrace_nmi_exit(void);
+extern void arch_ftrace_nmi_enter(void);
+extern void arch_ftrace_nmi_exit(void);
#else
-static inline void ftrace_nmi_enter(void) { }
-static inline void ftrace_nmi_exit(void) { }
+static inline void arch_ftrace_nmi_enter(void) { }
+static inline void arch_ftrace_nmi_exit(void) { }
#endif
+#ifdef CONFIG_HWLAT_TRACER
+extern bool trace_hwlat_callback_enabled;
+extern void trace_hwlat_callback(bool enter);
+#endif
+
+static inline void ftrace_nmi_enter(void)
+{
+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)
+ trace_hwlat_callback(true);
+#endif
+ arch_ftrace_nmi_enter();
+}
+
+static inline void ftrace_nmi_exit(void)
+{
+ arch_ftrace_nmi_exit();
+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)
+ trace_hwlat_callback(false);
+#endif
+}
+
#endif /* _LINUX_FTRACE_IRQ_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ba3326785ca4..2a96b063d659 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -216,6 +216,41 @@ config SCHED_TRACER
This tracer tracks the latency of the highest priority task
to be scheduled in, starting from the point it has woken up.
+config HWLAT_TRACER
+ bool "Tracer to detect hardware latencies (like SMIs)"
+ select GENERIC_TRACER
+ help
+ This tracer, when enabled will create one or more kernel threads,
+ depening on what the cpumask file is set to, which each thread
+ spinning in a loop looking for interruptions caused by
+ something other than the kernel. For example, if a
+ System Management Interrupt (SMI) takes a noticeable amount of
+ time, this tracer will detect it. This is useful for testing
+ if a system is reliable for Real Time tasks.
+
+ Some files are created in the tracing directory when this
+ is enabled:
+
+ hwlat_detector/width - time in usecs for how long to spin for
+ hwlat_detector/window - time in usecs between the start of each
+ iteration
+
+ A kernel thread is created that will spin with interrupts disabled
+ for "width" microseconds in every "widow" cycle. It will not spin
+ for "window - width" microseconds, where the system can
+ continue to operate.
+
+ The output will appear in the trace and trace_pipe files.
+
+ When the tracer is not running, it has no affect on the system,
+ but when it is running, it can cause the system to be
+ periodically non responsive. Do not run this tracer on a
+ production system.
+
+ To enable this tracer, echo in "hwlat" into the current_tracer
+ file. Every time a latency is greater than tracing_thresh, it will
+ be recorded into the ring buffer.
+
config ENABLE_DEFAULT_TRACERS
bool "Trace process context switches and events"
depends on !GENERIC_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index d0a1617b52b4..992ab9d99f35 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o
obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
+obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 84752c8e28b5..2050a7652a86 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -872,7 +872,13 @@ function_profile_call(unsigned long ip, unsigned long parent_ip,
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
static int profile_graph_entry(struct ftrace_graph_ent *trace)
{
+ int index = trace->depth;
+
function_profile_call(trace->func, 0, NULL, NULL);
+
+ if (index >= 0 && index < FTRACE_RETFUNC_DEPTH)
+ current->ret_stack[index].subtime = 0;
+
return 1;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 37824d98ae71..8696ce6bf2f6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1047,7 +1047,7 @@ void disable_trace_on_warning(void)
*
* Shows real state of the ring buffer if it is enabled or not.
*/
-static int tracer_tracing_is_on(struct trace_array *tr)
+int tracer_tracing_is_on(struct trace_array *tr)
{
if (tr->trace_buffer.buffer)
return ring_buffer_record_is_on(tr->trace_buffer.buffer);
@@ -4969,7 +4969,7 @@ out:
return ret;
}
-#ifdef CONFIG_TRACER_MAX_TRACE
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
static ssize_t
tracing_max_lat_read(struct file *filp, char __user *ubuf,
@@ -5892,7 +5892,7 @@ static const struct file_operations tracing_thresh_fops = {
.llseek = generic_file_llseek,
};
-#ifdef CONFIG_TRACER_MAX_TRACE
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
static const struct file_operations tracing_max_lat_fops = {
.open = tracing_open_generic,
.read = tracing_max_lat_read,
@@ -7222,7 +7222,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
create_trace_options_dir(tr);
-#ifdef CONFIG_TRACER_MAX_TRACE
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
trace_create_file("tracing_max_latency", 0644, d_tracer,
&tr->max_latency, &tracing_max_lat_fops);
#endif
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f783df416726..fd24b1f9ac43 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,6 +38,7 @@ enum trace_type {
TRACE_USER_STACK,
TRACE_BLK,
TRACE_BPUTS,
+ TRACE_HWLAT,
__TRACE_LAST_TYPE,
};
@@ -213,6 +214,8 @@ struct trace_array {
*/
struct trace_buffer max_buffer;
bool allocated_snapshot;
+#endif
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
unsigned long max_latency;
#endif
struct trace_pid_list __rcu *filtered_pids;
@@ -326,6 +329,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \
IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT); \
IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \
+ IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
TRACE_MMIO_RW); \
IF_ASSIGN(var, ent, struct trace_mmiotrace_map, \
@@ -571,6 +575,7 @@ void tracing_reset_current(int cpu);
void tracing_reset_all_online_cpus(void);
int tracing_open_generic(struct inode *inode, struct file *filp);
bool tracing_is_disabled(void);
+int tracer_tracing_is_on(struct trace_array *tr);
struct dentry *trace_create_file(const char *name,
umode_t mode,
struct dentry *parent,
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 5c30efcda5e6..d1cc37e78f99 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -322,3 +322,30 @@ FTRACE_ENTRY(branch, trace_branch,
FILTER_OTHER
);
+
+FTRACE_ENTRY(hwlat, hwlat_entry,
+
+ TRACE_HWLAT,
+
+ F_STRUCT(
+ __field( u64, duration )
+ __field( u64, outer_duration )
+ __field( u64, nmi_total_ts )
+ __field_struct( struct timespec, timestamp )
+ __field_desc( long, timestamp, tv_sec )
+ __field_desc( long, timestamp, tv_nsec )
+ __field( unsigned int, nmi_count )
+ __field( unsigned int, seqnum )
+ ),
+
+ F_printk("cnt:%u\tts:%010lu.%010lu\tinner:%llu\touter:%llunmi-ts:%llu\tnmi-count:%u\n",
+ __entry->seqnum,
+ __entry->tv_sec,
+ __entry->tv_nsec,
+ __entry->duration,
+ __entry->outer_duration,
+ __entry->nmi_total_ts,
+ __entry->nmi_count),
+
+ FILTER_OTHER
+);
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index a975571cde24..6721a1e89f39 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1028,6 +1028,7 @@ static struct event_command trigger_traceon_cmd = {
static struct event_command trigger_traceoff_cmd = {
.name = "traceoff",
.trigger_type = ETT_TRACE_ONOFF,
+ .flags = EVENT_CMD_FL_POST_TRIGGER,
.func = event_trigger_callback,
.reg = register_trigger,
.unreg = unregister_trigger,
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 0cbe38a844fa..4e480e870474 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -170,7 +170,6 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, int *depth,
current->ret_stack[index].ret = ret;
current->ret_stack[index].func = func;
current->ret_stack[index].calltime = calltime;
- current->ret_stack[index].subtime = 0;
#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
current->ret_stack[index].fp = frame_pointer;
#endif
@@ -1183,6 +1182,11 @@ print_graph_comment(struct trace_seq *s, struct trace_entry *ent,
trace_seq_puts(s, "/* ");
switch (iter->ent->type) {
+ case TRACE_BPUTS:
+ ret = trace_print_bputs_msg_only(iter);
+ if (ret != TRACE_TYPE_HANDLED)
+ return ret;
+ break;
case TRACE_BPRINT:
ret = trace_print_bprintk_msg_only(iter);
if (ret != TRACE_TYPE_HANDLED)
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
new file mode 100644
index 000000000000..b97286c48735
--- /dev/null
+++ b/kernel/trace/trace_hwlat.c
@@ -0,0 +1,633 @@
+/*
+ * trace_hwlatdetect.c - A simple Hardware Latency detector.
+ *
+ * Use this tracer to detect large system latencies induced by the behavior of
+ * certain underlying system hardware or firmware, independent of Linux itself.
+ * The code was developed originally to detect the presence of SMIs on Intel
+ * and AMD systems, although there is no dependency upon x86 herein.
+ *
+ * The classical example usage of this tracer is in detecting the presence of
+ * SMIs or System Management Interrupts on Intel and AMD systems. An SMI is a
+ * somewhat special form of hardware interrupt spawned from earlier CPU debug
+ * modes in which the (BIOS/EFI/etc.) firmware arranges for the South Bridge
+ * LPC (or other device) to generate a special interrupt under certain
+ * circumstances, for example, upon expiration of a special SMI timer device,
+ * due to certain external thermal readings, on certain I/O address accesses,
+ * and other situations. An SMI hits a special CPU pin, triggers a special
+ * SMI mode (complete with special memory map), and the OS is unaware.
+ *
+ * Although certain hardware-inducing latencies are necessary (for example,
+ * a modern system often requires an SMI handler for correct thermal control
+ * and remote management) they can wreak havoc upon any OS-level performance
+ * guarantees toward low-latency, especially when the OS is not even made
+ * aware of the presence of these interrupts. For this reason, we need a
+ * somewhat brute force mechanism to detect these interrupts. In this case,
+ * we do it by hogging all of the CPU(s) for configurable timer intervals,
+ * sampling the built-in CPU timer, looking for discontiguous readings.
+ *
+ * WARNING: This implementation necessarily introduces latencies. Therefore,
+ * you should NEVER use this tracer while running in a production
+ * environment requiring any kind of low-latency performance
+ * guarantee(s).
+ *
+ * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
+ * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
+ *
+ * Includes useful feedback from Clark Williams <clark@redhat.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#include <linux/kthread.h>
+#include <linux/tracefs.h>
+#include <linux/uaccess.h>
+#include <linux/cpumask.h>
+#include <linux/delay.h>
+#include "trace.h"
+
+static struct trace_array *hwlat_trace;
+
+#define U64STR_SIZE 22 /* 20 digits max */
+
+#define BANNER "hwlat_detector: "
+#define DEFAULT_SAMPLE_WINDOW 1000000 /* 1s */
+#define DEFAULT_SAMPLE_WIDTH 500000 /* 0.5s */
+#define DEFAULT_LAT_THRESHOLD 10 /* 10us */
+
+/* sampling thread*/
+static struct task_struct *hwlat_kthread;
+
+static struct dentry *hwlat_sample_width; /* sample width us */
+static struct dentry *hwlat_sample_window; /* sample window us */
+
+/* Save the previous tracing_thresh value */
+static unsigned long save_tracing_thresh;
+
+/* NMI timestamp counters */
+static u64 nmi_ts_start;
+static u64 nmi_total_ts;
+static int nmi_count;
+static int nmi_cpu;
+
+/* Tells NMIs to call back to the hwlat tracer to record timestamps */
+bool trace_hwlat_callback_enabled;
+
+/* If the user changed threshold, remember it */
+static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * NSEC_PER_USEC;
+
+/* Individual latency samples are stored here when detected. */
+struct hwlat_sample {
+ u64 seqnum; /* unique sequence */
+ u64 duration; /* delta */
+ u64 outer_duration; /* delta (outer loop) */
+ u64 nmi_total_ts; /* Total time spent in NMIs */
+ struct timespec timestamp; /* wall time */
+ int nmi_count; /* # NMIs during this sample */
+};
+
+/* keep the global state somewhere. */
+static struct hwlat_data {
+
+ struct mutex lock; /* protect changes */
+
+ u64 count; /* total since reset */
+
+ u64 sample_window; /* total sampling window (on+off) */
+ u64 sample_width; /* active sampling portion of window */
+
+} hwlat_data = {
+ .sample_window = DEFAULT_SAMPLE_WINDOW,
+ .sample_width = DEFAULT_SAMPLE_WIDTH,
+};
+
+static void trace_hwlat_sample(struct hwlat_sample *sample)
+{
+ struct trace_array *tr = hwlat_trace;
+ struct trace_event_call *call = &event_hwlat;
+ struct ring_buffer *buffer = tr->trace_buffer.buffer;
+ struct ring_buffer_event *event;
+ struct hwlat_entry *entry;
+ unsigned long flags;
+ int pc;
+
+ pc = preempt_count();
+ local_save_flags(flags);
+
+ event = trace_buffer_lock_reserve(buffer, TRACE_HWLAT, sizeof(*entry),
+ flags, pc);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ entry->seqnum = sample->seqnum;
+ entry->duration = sample->duration;
+ entry->outer_duration = sample->outer_duration;
+ entry->timestamp = sample->timestamp;
+ entry->nmi_total_ts = sample->nmi_total_ts;
+ entry->nmi_count = sample->nmi_count;
+
+ if (!call_filter_check_discard(call, entry, buffer, event))
+ __buffer_unlock_commit(buffer, event);
+}
+
+/* Macros to encapsulate the time capturing infrastructure */
+#define time_type u64
+#define time_get() trace_clock_local()
+#define time_to_us(x) div_u64(x, 1000)
+#define time_sub(a, b) ((a) - (b))
+#define init_time(a, b) (a = b)
+#define time_u64(a) a
+
+void trace_hwlat_callback(bool enter)
+{
+ if (smp_processor_id() != nmi_cpu)
+ return;
+
+ /*
+ * Currently trace_clock_local() calls sched_clock() and the
+ * generic version is not NMI safe.
+ */
+ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
+ if (enter)
+ nmi_ts_start = time_get();
+ else
+ nmi_total_ts = time_get() - nmi_ts_start;
+ }
+
+ if (enter)
+ nmi_count++;
+}
+
+/**
+ * get_sample - sample the CPU TSC and look for likely hardware latencies
+ *
+ * Used to repeatedly capture the CPU TSC (or similar), looking for potential
+ * hardware-induced latency. Called with interrupts disabled and with
+ * hwlat_data.lock held.
+ */
+static int get_sample(void)
+{
+ struct trace_array *tr = hwlat_trace;
+ time_type start, t1, t2, last_t2;
+ s64 diff, total, last_total = 0;
+ u64 sample = 0;
+ u64 thresh = tracing_thresh;
+ u64 outer_sample = 0;
+ int ret = -1;
+
+ do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
+
+ nmi_cpu = smp_processor_id();
+ nmi_total_ts = 0;
+ nmi_count = 0;
+ /* Make sure NMIs see this first */
+ barrier();
+
+ trace_hwlat_callback_enabled = true;
+
+ init_time(last_t2, 0);
+ start = time_get(); /* start timestamp */
+
+ do {
+
+ t1 = time_get(); /* we'll look for a discontinuity */
+ t2 = time_get();
+
+ if (time_u64(last_t2)) {
+ /* Check the delta from outer loop (t2 to next t1) */
+ diff = time_to_us(time_sub(t1, last_t2));
+ /* This shouldn't happen */
+ if (diff < 0) {
+ pr_err(BANNER "time running backwards\n");
+ goto out;
+ }
+ if (diff > outer_sample)
+ outer_sample = diff;
+ }
+ last_t2 = t2;
+
+ total = time_to_us(time_sub(t2, start)); /* sample width */
+
+ /* Check for possible overflows */
+ if (total < last_total) {
+ pr_err("Time total overflowed\n");
+ break;
+ }
+ last_total = total;
+
+ /* This checks the inner loop (t1 to t2) */
+ diff = time_to_us(time_sub(t2, t1)); /* current diff */
+
+ /* This shouldn't happen */
+ if (diff < 0) {
+ pr_err(BANNER "time running backwards\n");
+ goto out;
+ }
+
+ if (diff > sample)
+ sample = diff; /* only want highest value */
+
+ } while (total <= hwlat_data.sample_width);
+
+ barrier(); /* finish the above in the view for NMIs */
+ trace_hwlat_callback_enabled = false;
+ barrier(); /* Make sure nmi_total_ts is no longer updated */
+
+ ret = 0;
+
+ /* If we exceed the threshold value, we have found a hardware latency */
+ if (sample > thresh || outer_sample > thresh) {
+ struct hwlat_sample s;
+
+ ret = 1;
+
+ /* We read in microseconds */
+ if (nmi_total_ts)
+ do_div(nmi_total_ts, NSEC_PER_USEC);
+
+ hwlat_data.count++;
+ s.seqnum = hwlat_data.count;
+ s.duration = sample;
+ s.outer_duration = outer_sample;
+ s.timestamp = CURRENT_TIME;
+ s.nmi_total_ts = nmi_total_ts;
+ s.nmi_count = nmi_count;
+ trace_hwlat_sample(&s);
+
+ /* Keep a running maximum ever recorded hardware latency */
+ if (sample > tr->max_latency)
+ tr->max_latency = sample;
+ }
+
+out:
+ return ret;
+}
+
+static struct cpumask save_cpumask;
+static bool disable_migrate;
+
+static void move_to_next_cpu(void)
+{
+ static struct cpumask *current_mask;
+ int next_cpu;
+
+ if (disable_migrate)
+ return;
+
+ /* Just pick the first CPU on first iteration */
+ if (!current_mask) {
+ current_mask = &save_cpumask;
+ get_online_cpus();
+ cpumask_and(current_mask, cpu_online_mask, tracing_buffer_mask);
+ put_online_cpus();
+ next_cpu = cpumask_first(current_mask);
+ goto set_affinity;
+ }
+
+ /*
+ * If for some reason the user modifies the CPU affinity
+ * of this thread, than stop migrating for the duration
+ * of the current test.
+ */
+ if (!cpumask_equal(current_mask, &current->cpus_allowed))
+ goto disable;
+
+ get_online_cpus();
+ cpumask_and(current_mask, cpu_online_mask, tracing_buffer_mask);
+ next_cpu = cpumask_next(smp_processor_id(), current_mask);
+ put_online_cpus();
+
+ if (next_cpu >= nr_cpu_ids)
+ next_cpu = cpumask_first(current_mask);
+
+ set_affinity:
+ if (next_cpu >= nr_cpu_ids) /* Shouldn't happen! */
+ goto disable;
+
+ cpumask_clear(current_mask);
+ cpumask_set_cpu(next_cpu, current_mask);
+
+ sched_setaffinity(0, current_mask);
+ return;
+
+ disable:
+ disable_migrate = true;
+}
+
+/*
+ * kthread_fn - The CPU time sampling/hardware latency detection kernel thread
+ *
+ * Used to periodically sample the CPU TSC via a call to get_sample. We
+ * disable interrupts, which does (intentionally) introduce latency since we
+ * need to ensure nothing else might be running (and thus preempting).
+ * Obviously this should never be used in production environments.
+ *
+ * Currently this runs on which ever CPU it was scheduled on, but most
+ * real-world hardware latency situations occur across several CPUs,
+ * but we might later generalize this if we find there are any actualy
+ * systems with alternate SMI delivery or other hardware latencies.
+ */
+static int kthread_fn(void *data)
+{
+ u64 interval;
+
+ while (!kthread_should_stop()) {
+
+ move_to_next_cpu();
+
+ local_irq_disable();
+ get_sample();
+ local_irq_enable();
+
+ mutex_lock(&hwlat_data.lock);
+ interval = hwlat_data.sample_window - hwlat_data.sample_width;
+ mutex_unlock(&hwlat_data.lock);
+
+ do_div(interval, USEC_PER_MSEC); /* modifies interval value */
+
+ /* Always sleep for at least 1ms */
+ if (interval < 1)
+ interval = 1;
+
+ if (msleep_interruptible(interval))
+ break;
+ }
+
+ return 0;
+}
+
+/**
+ * start_kthread - Kick off the hardware latency sampling/detector kthread
+ *
+ * This starts the kernel thread that will sit and sample the CPU timestamp
+ * counter (TSC or similar) and look for potential hardware latencies.
+ */
+static int start_kthread(struct trace_array *tr)
+{
+ struct task_struct *kthread;
+
+ kthread = kthread_create(kthread_fn, NULL, "hwlatd");
+ if (IS_ERR(kthread)) {
+ pr_err(BANNER "could not start sampling thread\n");
+ return -ENOMEM;
+ }
+ hwlat_kthread = kthread;
+ wake_up_process(kthread);
+
+ return 0;
+}
+
+/**
+ * stop_kthread - Inform the hardware latency samping/detector kthread to stop
+ *
+ * This kicks the running hardware latency sampling/detector kernel thread and
+ * tells it to stop sampling now. Use this on unload and at system shutdown.
+ */
+static void stop_kthread(void)
+{
+ if (!hwlat_kthread)
+ return;
+ kthread_stop(hwlat_kthread);
+ hwlat_kthread = NULL;
+}
+
+/*
+ * hwlat_read - Wrapper read function for reading both window and width
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * This function provides a generic read implementation for the global state
+ * "hwlat_data" structure filesystem entries.
+ */
+static ssize_t hwlat_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[U64STR_SIZE];
+ u64 *entry = filp->private_data;
+ u64 val;
+ int len;
+
+ if (!entry)
+ return -EFAULT;
+
+ if (cnt > sizeof(buf))
+ cnt = sizeof(buf);
+
+ val = *entry;
+
+ len = snprintf(buf, sizeof(buf), "%llu\n", val);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
+}
+
+/**
+ * hwlat_width_write - Write function for "width" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "width" interface
+ * to the hardware latency detector. It can be used to configure
+ * for how many us of the total window us we will actively sample for any
+ * hardware-induced latency periods. Obviously, it is not possible to
+ * sample constantly and have the system respond to a sample reader, or,
+ * worse, without having the system appear to have gone out to lunch. It
+ * is enforced that width is less that the total window size.
+ */
+static ssize_t
+hwlat_width_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ u64 val;
+ int err;
+
+ err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+ if (err)
+ return err;
+
+ mutex_lock(&hwlat_data.lock);
+ if (val < hwlat_data.sample_window)
+ hwlat_data.sample_width = val;
+ else
+ err = -EINVAL;
+ mutex_unlock(&hwlat_data.lock);
+
+ if (err)
+ return err;
+
+ return cnt;
+}
+
+/**
+ * hwlat_window_write - Write function for "window" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "window" interface
+ * to the hardware latency detetector. The window is the total time
+ * in us that will be considered one sample period. Conceptually, windows
+ * occur back-to-back and contain a sample width period during which
+ * actual sampling occurs. Can be used to write a new total window size. It
+ * is enfoced that any value written must be greater than the sample width
+ * size, or an error results.
+ */
+static ssize_t
+hwlat_window_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ u64 val;
+ int err;
+
+ err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+ if (err)
+ return err;
+
+ mutex_lock(&hwlat_data.lock);
+ if (hwlat_data.sample_width < val)
+ hwlat_data.sample_window = val;
+ else
+ err = -EINVAL;
+ mutex_unlock(&hwlat_data.lock);
+
+ if (err)
+ return err;
+
+ return cnt;
+}
+
+static const struct file_operations width_fops = {
+ .open = tracing_open_generic,
+ .read = hwlat_read,
+ .write = hwlat_width_write,
+};
+
+static const struct file_operations window_fops = {
+ .open = tracing_open_generic,
+ .read = hwlat_read,
+ .write = hwlat_window_write,
+};
+
+/**
+ * init_tracefs - A function to initialize the tracefs interface files
+ *
+ * This function creates entries in tracefs for "hwlat_detector".
+ * It creates the hwlat_detector directory in the tracing directory,
+ * and within that directory is the count, width and window files to
+ * change and view those values.
+ */
+static int init_tracefs(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *top_dir;
+
+ d_tracer = tracing_init_dentry();
+ if (IS_ERR(d_tracer))
+ return -ENOMEM;
+
+ top_dir = tracefs_create_dir("hwlat_detector", d_tracer);
+ if (!top_dir)
+ return -ENOMEM;
+
+ hwlat_sample_window = tracefs_create_file("window", 0640,
+ top_dir,
+ &hwlat_data.sample_window,
+ &window_fops);
+ if (!hwlat_sample_window)
+ goto err;
+
+ hwlat_sample_width = tracefs_create_file("width", 0644,
+ top_dir,
+ &hwlat_data.sample_width,
+ &width_fops);
+ if (!hwlat_sample_width)
+ goto err;
+
+ return 0;
+
+ err:
+ tracefs_remove_recursive(top_dir);
+ return -ENOMEM;
+}
+
+static void hwlat_tracer_start(struct trace_array *tr)
+{
+ int err;
+
+ err = start_kthread(tr);
+ if (err)
+ pr_err(BANNER "Cannot start hwlat kthread\n");
+}
+
+static void hwlat_tracer_stop(struct trace_array *tr)
+{
+ stop_kthread();
+}
+
+static bool hwlat_busy;
+
+static int hwlat_tracer_init(struct trace_array *tr)
+{
+ /* Only allow one instance to enable this */
+ if (hwlat_busy)
+ return -EBUSY;
+
+ hwlat_trace = tr;
+
+ disable_migrate = false;
+ hwlat_data.count = 0;
+ tr->max_latency = 0;
+ save_tracing_thresh = tracing_thresh;
+
+ /* tracing_thresh is in nsecs, we speak in usecs */
+ if (!tracing_thresh)
+ tracing_thresh = last_tracing_thresh;
+
+ if (tracer_tracing_is_on(tr))
+ hwlat_tracer_start(tr);
+
+ hwlat_busy = true;
+
+ return 0;
+}
+
+static void hwlat_tracer_reset(struct trace_array *tr)
+{
+ stop_kthread();
+
+ /* the tracing threshold is static between runs */
+ last_tracing_thresh = tracing_thresh;
+
+ tracing_thresh = save_tracing_thresh;
+ hwlat_busy = false;
+}
+
+static struct tracer hwlat_tracer __read_mostly =
+{
+ .name = "hwlat",
+ .init = hwlat_tracer_init,
+ .reset = hwlat_tracer_reset,
+ .start = hwlat_tracer_start,
+ .stop = hwlat_tracer_stop,
+ .allow_instances = true,
+};
+
+__init static int init_hwlat_tracer(void)
+{
+ int ret;
+
+ mutex_init(&hwlat_data.lock);
+
+ ret = register_tracer(&hwlat_tracer);
+ if (ret)
+ return ret;
+
+ init_tracefs();
+
+ return 0;
+}
+late_initcall(init_hwlat_tracer);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 0bb9cf2d53e6..3fc20422c166 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1098,6 +1098,71 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_HWLAT */
+static enum print_line_t
+trace_hwlat_print(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_seq *s = &iter->seq;
+ struct hwlat_entry *field;
+
+ trace_assign_type(field, entry);
+
+ trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%ld.%09ld",
+ field->seqnum,
+ field->duration,
+ field->outer_duration,
+ field->timestamp.tv_sec,
+ field->timestamp.tv_nsec);
+
+ if (field->nmi_count) {
+ /*
+ * The generic sched_clock() is not NMI safe, thus
+ * we only record the count and not the time.
+ */
+ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK))
+ trace_seq_printf(s, " nmi-total:%llu",
+ field->nmi_total_ts);
+ trace_seq_printf(s, " nmi-count:%u",
+ field->nmi_count);
+ }
+
+ trace_seq_putc(s, '\n');
+
+ return trace_handle_return(s);
+}
+
+
+static enum print_line_t
+trace_hwlat_raw(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct hwlat_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+
+ trace_seq_printf(s, "%llu %lld %ld %09ld %u\n",
+ field->duration,
+ field->outer_duration,
+ field->timestamp.tv_sec,
+ field->timestamp.tv_nsec,
+ field->seqnum);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_hwlat_funcs = {
+ .trace = trace_hwlat_print,
+ .raw = trace_hwlat_raw,
+};
+
+static struct trace_event trace_hwlat_event = {
+ .type = TRACE_HWLAT,
+ .funcs = &trace_hwlat_funcs,
+};
+
/* TRACE_BPUTS */
static enum print_line_t
trace_bputs_print(struct trace_iterator *iter, int flags,
@@ -1233,6 +1298,7 @@ static struct trace_event *events[] __initdata = {
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
+ &trace_hwlat_event,
NULL
};
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7a687320f867..0913693caf6e 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -431,10 +431,6 @@ static int create_trace_uprobe(int argc, char **argv)
pr_info("Probe point is not specified.\n");
return -EINVAL;
}
- if (isdigit(argv[1][0])) {
- pr_info("probe point must be have a filename.\n");
- return -EINVAL;
- }
arg = strchr(argv[1], ':');
if (!arg) {
ret = -EINVAL;
diff --git a/scripts/tracing/ftrace-bisect.sh b/scripts/tracing/ftrace-bisect.sh
new file mode 100755
index 000000000000..9ff8ac5fc53c
--- /dev/null
+++ b/scripts/tracing/ftrace-bisect.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+#
+# Here's how to use this:
+#
+# This script is used to help find functions that are being traced by function
+# tracer or function graph tracing that causes the machine to reboot, hang, or
+# crash. Here's the steps to take.
+#
+# First, determine if function tracing is working with a single function:
+#
+# (note, if this is a problem with function_graph tracing, then simply
+# replace "function" with "function_graph" in the following steps).
+#
+# # cd /sys/kernel/debug/tracing
+# # echo schedule > set_ftrace_filter
+# # echo function > current_tracer
+#
+# If this works, then we know that something is being traced that shouldn't be.
+#
+# # echo nop > current_tracer
+#
+# # cat available_filter_functions > ~/full-file
+# # ftrace-bisect ~/full-file ~/test-file ~/non-test-file
+# # cat ~/test-file > set_ftrace_filter
+#
+# *** Note *** this will take several minutes. Setting multiple functions is
+# an O(n^2) operation, and we are dealing with thousands of functions. So go
+# have coffee, talk with your coworkers, read facebook. And eventually, this
+# operation will end.
+#
+# # echo function > current_tracer
+#
+# If it crashes, we know that ~/test-file has a bad function.
+#
+# Reboot back to test kernel.
+#
+# # cd /sys/kernel/debug/tracing
+# # mv ~/test-file ~/full-file
+#
+# If it didn't crash.
+#
+# # echo nop > current_tracer
+# # mv ~/non-test-file ~/full-file
+#
+# Get rid of the other test file from previous run (or save them off somewhere).
+# # rm -f ~/test-file ~/non-test-file
+#
+# And start again:
+#
+# # ftrace-bisect ~/full-file ~/test-file ~/non-test-file
+#
+# The good thing is, because this cuts the number of functions in ~/test-file
+# by half, the cat of it into set_ftrace_filter takes half as long each
+# iteration, so don't talk so much at the water cooler the second time.
+#
+# Eventually, if you did this correctly, you will get down to the problem
+# function, and all we need to do is to notrace it.
+#
+# The way to figure out if the problem function is bad, just do:
+#
+# # echo <problem-function> > set_ftrace_notrace
+# # echo > set_ftrace_filter
+# # echo function > current_tracer
+#
+# And if it doesn't crash, we are done.
+#
+# If it does crash, do this again (there's more than one problem function)
+# but you need to echo the problem function(s) into set_ftrace_notrace before
+# enabling function tracing in the above steps. Or if you can compile the
+# kernel, annotate the problem functions with "notrace" and start again.
+#
+
+
+if [ $# -ne 3 ]; then
+ echo 'usage: ftrace-bisect full-file test-file non-test-file'
+ exit
+fi
+
+full=$1
+test=$2
+nontest=$3
+
+x=`cat $full | wc -l`
+if [ $x -eq 1 ]; then
+ echo "There's only one function left, must be the bad one"
+ cat $full
+ exit 0
+fi
+
+let x=$x/2
+let y=$x+1
+
+if [ ! -f $full ]; then
+ echo "$full does not exist"
+ exit 1
+fi
+
+if [ -f $test ]; then
+ echo -n "$test exists, delete it? [y/N]"
+ read a
+ if [ "$a" != "y" -a "$a" != "Y" ]; then
+ exit 1
+ fi
+fi
+
+if [ -f $nontest ]; then
+ echo -n "$nontest exists, delete it? [y/N]"
+ read a
+ if [ "$a" != "y" -a "$a" != "Y" ]; then
+ exit 1
+ fi
+fi
+
+sed -ne "1,${x}p" $full > $test
+sed -ne "$y,\$p" $full > $nontest