diff options
Diffstat (limited to 'tools/perf/Documentation')
-rw-r--r-- | tools/perf/Documentation/perf-annotate.txt | 3 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-config.txt | 8 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-kvm.txt | 9 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-lock.txt | 4 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-record.txt | 60 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-report.txt | 4 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-stat.txt | 27 | ||||
-rw-r--r-- | tools/perf/Documentation/perf-top.txt | 10 | ||||
-rw-r--r-- | tools/perf/Documentation/topdown.txt | 70 |
9 files changed, 135 insertions, 60 deletions
diff --git a/tools/perf/Documentation/perf-annotate.txt b/tools/perf/Documentation/perf-annotate.txt index 980fe2c29275..fe168e8165c8 100644 --- a/tools/perf/Documentation/perf-annotate.txt +++ b/tools/perf/Documentation/perf-annotate.txt @@ -116,6 +116,9 @@ include::itrace.txt[] -M:: --disassembler-style=:: Set disassembler style for objdump. +--addr2line=<path>:: + Path to addr2line binary. + --objdump=<path>:: Path to objdump binary. diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt index 39c890ead2dc..e56ae54805a8 100644 --- a/tools/perf/Documentation/perf-config.txt +++ b/tools/perf/Documentation/perf-config.txt @@ -250,7 +250,13 @@ annotate.*:: These are in control of addresses, jump function, source code in lines of assembly code from a specific program. - annotate.disassembler_style: + annotate.addr2line:: + addr2line binary to use for file names and line numbers. + + annotate.objdump:: + objdump binary to use for disassembly and annotations. + + annotate.disassembler_style:: Use this to change the default disassembler style to some other value supported by binutils, such as "intel", see the '-M' option help in the 'objdump' man page. diff --git a/tools/perf/Documentation/perf-kvm.txt b/tools/perf/Documentation/perf-kvm.txt index 2ad3f5d9f72b..b66be66fe836 100644 --- a/tools/perf/Documentation/perf-kvm.txt +++ b/tools/perf/Documentation/perf-kvm.txt @@ -58,7 +58,7 @@ There are a couple of variants of perf kvm: events. 'perf kvm stat report' reports statistical data which includes events - handled time, samples, and so on. + handled sample, percent_sample, time, percent_time, max_t, min_t, mean_t. 'perf kvm stat live' reports statistical data in a live mode (similar to record + report but with statistical data updated live at a given display @@ -82,6 +82,8 @@ OPTIONS :GMEXAMPLESUBCMD: top include::guest-files.txt[] +--stdio:: Use the stdio interface. + -v:: --verbose:: Be more verbose (show counter open errors, etc). @@ -97,7 +99,10 @@ STAT REPORT OPTIONS -k:: --key=<value>:: Sorting key. Possible values: sample (default, sort by samples - number), time (sort by average time). + number), percent_sample (sort by sample percentage), time + (sort by average time), precent_time (sort by time percentage), + max_t (sort by maximum time), min_t (sort by minimum time), mean_t + (sort by mean time). -p:: --pid=:: Analyze events only for given process ID(s) (comma separated list). diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt index 37aae194a2a1..6e5ba3cd2b72 100644 --- a/tools/perf/Documentation/perf-lock.txt +++ b/tools/perf/Documentation/perf-lock.txt @@ -155,8 +155,10 @@ CONTENTION OPTIONS --tid=<value>:: Record events on existing thread ID (comma separated list). +-M:: --map-nr-entries=<value>:: - Maximum number of BPF map entries (default: 10240). + Maximum number of BPF map entries (default: 16384). + This will be aligned to a power of 2. --max-stack=<value>:: Maximum stack depth when collecting lock contention (default: 8). diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index ff815c2f67e8..680396c56bd1 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -119,9 +119,12 @@ OPTIONS "perf report" to view group events together. --filter=<filter>:: - Event filter. This option should follow an event selector (-e) which - selects either tracepoint event(s) or a hardware trace PMU - (e.g. Intel PT or CoreSight). + Event filter. This option should follow an event selector (-e). + If the event is a tracepoint, the filter string will be parsed by + the kernel. If the event is a hardware trace PMU (e.g. Intel PT + or CoreSight), it'll be processed as an address filter. Otherwise + it means a general filter using BPF which can be applied for any + kind of event. - tracepoint filters @@ -176,6 +179,57 @@ OPTIONS Multiple filters can be separated with space or comma. + - bpf filters + + A BPF filter can access the sample data and make a decision based on the + data. Users need to set an appropriate sample type to use the BPF + filter. BPF filters need root privilege. + + The sample data field can be specified in lower case letter. Multiple + filters can be separated with comma. For example, + + --filter 'period > 1000, cpu == 1' + or + --filter 'mem_op == load || mem_op == store, mem_lvl > l1' + + The former filter only accept samples with period greater than 1000 AND + CPU number is 1. The latter one accepts either load and store memory + operations but it should have memory level above the L1. Since the + mem_op and mem_lvl fields come from the (memory) data_source, it'd only + work with some events which set the data_source field. + + Also user should request to collect that information (with -d option in + the above case). Otherwise, the following message will be shown. + + $ sudo perf record -e cycles --filter 'mem_op == load' + Error: cycles event does not have PERF_SAMPLE_DATA_SRC + Hint: please add -d option to perf record. + failed to set filter "BPF" on event cycles with 22 (Invalid argument) + + Essentially the BPF filter expression is: + + <term> <operator> <value> (("," | "||") <term> <operator> <value>)* + + The <term> can be one of: + ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr, + code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat, + p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock, + mem_dtlb, mem_blk, mem_hops + + The <operator> can be one of: + ==, !=, >, >=, <, <=, & + + The <value> can be one of: + <number> (for any term) + na, load, store, pfetch, exec (for mem_op) + l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl) + na, none, hit, miss, hitm, fwd, peer (for mem_snoop) + remote (for mem_remote) + na, locked (for mem_locked) + na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb) + na, by_data, by_addr (for mem_blk) + hops0, hops1, hops2, hops3 (for mem_hops) + --exclude-perf:: Don't record events issued by perf itself. This option should follow an event selector (-e) which selects tracepoint event(s). It adds a diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt index c242e8da6b1a..af068b4f1e5a 100644 --- a/tools/perf/Documentation/perf-report.txt +++ b/tools/perf/Documentation/perf-report.txt @@ -117,6 +117,7 @@ OPTIONS - addr: (Full) virtual address of the sampled instruction - retire_lat: On X86, this reports pipeline stall of this instruction compared to the previous instruction in cycles. And currently supported only on X86 + - simd: Flags describing a SIMD operation. "e" for empty Arm SVE predicate. "p" for partial Arm SVE predicate By default, comm, dso and symbol keys are used. (i.e. --sort comm,dso,symbol) @@ -380,6 +381,9 @@ OPTIONS This allows to examine the path the program took to each sample. The data collection must have used -b (or -j) and -g. +--addr2line=<path>:: + Path to addr2line binary. + --objdump=<path>:: Path to objdump binary. diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 18abdc1dce05..29bdcfa93f04 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -394,10 +394,10 @@ See perf list output for the possible metrics and metricgroups. Do not aggregate counts across all monitored CPUs. --topdown:: -Print complete top-down metrics supported by the CPU. This allows to -determine bottle necks in the CPU pipeline for CPU bound workloads, -by breaking the cycles consumed down into frontend bound, backend bound, -bad speculation and retiring. +Print top-down metrics supported by the CPU. This allows to determine +bottle necks in the CPU pipeline for CPU bound workloads, by breaking +the cycles consumed down into frontend bound, backend bound, bad +speculation and retiring. Frontend bound means that the CPU cannot fetch and decode instructions fast enough. Backend bound means that computation or memory access is the bottle @@ -430,15 +430,18 @@ CPUs the workload runs on. If needed the CPUs can be forced using taskset. --td-level:: -Print the top-down statistics that equal to or lower than the input level. -It allows users to print the interested top-down metrics level instead of -the complete top-down metrics. +Print the top-down statistics that equal the input level. It allows +users to print the interested top-down metrics level instead of the +level 1 top-down metrics. + +As the higher levels gather more metrics and use more counters they +will be less accurate. By convention a metric can be examined by +appending '_group' to it and this will increase accuracy compared to +gathering all metrics for a level. For example, level 1 analysis may +highlight 'tma_frontend_bound'. This metric may be drilled into with +'tma_frontend_bound_group' with +'perf stat -M tma_frontend_bound_group...'. -The availability of the top-down metrics level depends on the hardware. For -example, Ice Lake only supports L1 top-down metrics. The Sapphire Rapids -supports both L1 and L2 top-down metrics. - -Default: 0 means the max level that the current hardware support. Error out if the input is higher than the supported max level. --no-merge:: diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt index c60e615b7183..3c202ec080ba 100644 --- a/tools/perf/Documentation/perf-top.txt +++ b/tools/perf/Documentation/perf-top.txt @@ -161,6 +161,12 @@ Default is to monitor all CPUS. -M:: --disassembler-style=:: Set disassembler style for objdump. +--addr2line=<path>:: + Path to addr2line binary. + +--objdump=<path>:: + Path to objdump binary. + --prefix=PREFIX:: --prefix-strip=N:: Remove first N entries from source file path names in executables @@ -248,6 +254,10 @@ Default is to monitor all CPUS. The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k Note that this feature may not be available on all processors. +--branch-history:: + Add the addresses of sampled taken branches to the callstack. + This allows to examine the path the program took to each sample. + --raw-trace:: When displaying traceevent output, do not use print fmt or plugins. diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt index a15b93fdcf50..ae0aee86844f 100644 --- a/tools/perf/Documentation/topdown.txt +++ b/tools/perf/Documentation/topdown.txt @@ -1,46 +1,35 @@ -Using TopDown metrics in user space ------------------------------------ +Using TopDown metrics +--------------------- -Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown -methodology to break down CPU pipeline execution into 4 bottlenecks: -frontend bound, backend bound, bad speculation, retiring. +TopDown metrics break apart performance bottlenecks. Starting at level +1 it is typical to get metrics on retiring, bad speculation, frontend +bound, and backend bound. Higher levels provide more detail in to the +level 1 bottlenecks, such as at level 2: core bound, memory bound, +heavy operations, light operations, branch mispredicts, machine +clears, fetch latency and fetch bandwidth. For more details see [1][2][3]. -For more details on Topdown see [1][5] +perf stat --topdown implements this using available metrics that vary +per architecture. -Traditionally this was implemented by events in generic counters -and specific formulas to compute the bottlenecks. - -perf stat --topdown implements this. - -Full Top Down includes more levels that can break down the -bottlenecks further. This is not directly implemented in perf, -but available in other tools that can run on top of perf, -such as toplev[2] or vtune[3] +% perf stat -a --topdown -I1000 +# time % tma_retiring % tma_backend_bound % tma_frontend_bound % tma_bad_speculation + 1.001141351 11.5 34.9 46.9 6.7 + 2.006141972 13.4 28.1 50.4 8.1 + 3.010162040 12.9 28.1 51.1 8.0 + 4.014009311 12.5 28.6 51.8 7.2 + 5.017838554 11.8 33.0 48.0 7.2 + 5.704818971 14.0 27.5 51.3 7.3 +... -New Topdown features in Ice Lake -=============================== +New Topdown features in Intel Ice Lake +====================================== With Ice Lake CPUs the TopDown metrics are directly available as fixed counters and do not require generic counters. This allows to collect TopDown always in addition to other events. -% perf stat -a --topdown -I1000 -# time retiring bad speculation frontend bound backend bound - 1.001281330 23.0% 15.3% 29.6% 32.1% - 2.003009005 5.0% 6.8% 46.6% 41.6% - 3.004646182 6.7% 6.7% 46.0% 40.6% - 4.006326375 5.0% 6.4% 47.6% 41.0% - 5.007991804 5.1% 6.3% 46.3% 42.3% - 6.009626773 6.2% 7.1% 47.3% 39.3% - 7.011296356 4.7% 6.7% 46.2% 42.4% - 8.012951831 4.7% 6.7% 47.5% 41.1% -... - -This also enables measuring TopDown per thread/process instead -of only per core. - -Using TopDown through RDPMC in applications on Ice Lake -====================================================== +Using TopDown through RDPMC in applications on Intel Ice Lake +============================================================= For more fine grained measurements it can be useful to access the new directly from user space. This is more complicated, @@ -301,8 +290,8 @@ This "opens" a new measurement period. A program using RDPMC for TopDown should schedule such a reset regularly, as in every few seconds. -Limits on Ice Lake -================== +Limits on Intel Ice Lake +======================== Four pseudo TopDown metric events are exposed for the end-users, topdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. @@ -318,8 +307,8 @@ a sampling read group. Since the SLOTS event must be the leader of a TopDown group, the second event of the group is the sampling event. For example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' -Extension on Sapphire Rapids Server -=================================== +Extension on Intel Sapphire Rapids Server +========================================= The metrics counter is extended to support TMA method level 2 metrics. The lower half of the register is the TMA level 1 metrics (legacy). The upper half is also divided into four 8-bit fields for the new level 2 @@ -338,7 +327,6 @@ other four level 2 metrics by subtracting corresponding metrics as below. [1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win -[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual -[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe +[2] https://sites.google.com/site/analysismethods/yasin-pubs +[3] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis [4] https://github.com/andikleen/pmu-tools/tree/master/jevents -[5] https://sites.google.com/site/analysismethods/yasin-pubs |