diff options
Diffstat (limited to 'Documentation/staging')
-rw-r--r-- | Documentation/staging/crc32.rst | 189 | ||||
-rw-r--r-- | Documentation/staging/index.rst | 32 | ||||
-rw-r--r-- | Documentation/staging/kprobes.rst | 801 | ||||
-rw-r--r-- | Documentation/staging/lzo.rst | 202 | ||||
-rw-r--r-- | Documentation/staging/remoteproc.rst | 359 | ||||
-rw-r--r-- | Documentation/staging/rpmsg.rst | 341 | ||||
-rw-r--r-- | Documentation/staging/speculation.rst | 92 | ||||
-rw-r--r-- | Documentation/staging/static-keys.rst | 331 | ||||
-rw-r--r-- | Documentation/staging/tee.rst | 277 | ||||
-rw-r--r-- | Documentation/staging/xz.rst | 127 |
10 files changed, 2751 insertions, 0 deletions
diff --git a/Documentation/staging/crc32.rst b/Documentation/staging/crc32.rst new file mode 100644 index 000000000000..8a6860f33b4e --- /dev/null +++ b/Documentation/staging/crc32.rst @@ -0,0 +1,189 @@ +================================= +brief tutorial on CRC computation +================================= + +A CRC is a long-division remainder. You add the CRC to the message, +and the whole thing (message+CRC) is a multiple of the given +CRC polynomial. To check the CRC, you can either check that the +CRC matches the recomputed value, *or* you can check that the +remainder computed on the message+CRC is 0. This latter approach +is used by a lot of hardware implementations, and is why so many +protocols put the end-of-frame flag after the CRC. + +It's actually the same long division you learned in school, except that: + +- We're working in binary, so the digits are only 0 and 1, and +- When dividing polynomials, there are no carries. Rather than add and + subtract, we just xor. Thus, we tend to get a bit sloppy about + the difference between adding and subtracting. + +Like all division, the remainder is always smaller than the divisor. +To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial. +Since it's 33 bits long, bit 32 is always going to be set, so usually the +CRC is written in hex with the most significant bit omitted. (If you're +familiar with the IEEE 754 floating-point format, it's the same idea.) + +Note that a CRC is computed over a string of *bits*, so you have +to decide on the endianness of the bits within each byte. To get +the best error-detecting properties, this should correspond to the +order they're actually sent. For example, standard RS-232 serial is +little-endian; the most significant bit (sometimes used for parity) +is sent last. And when appending a CRC word to a message, you should +do it in the right order, matching the endianness. + +Just like with ordinary division, you proceed one digit (bit) at a time. +Each step of the division you take one more digit (bit) of the dividend +and append it to the current remainder. Then you figure out the +appropriate multiple of the divisor to subtract to being the remainder +back into range. In binary, this is easy - it has to be either 0 or 1, +and to make the XOR cancel, it's just a copy of bit 32 of the remainder. + +When computing a CRC, we don't care about the quotient, so we can +throw the quotient bit away, but subtract the appropriate multiple of +the polynomial from the remainder and we're back to where we started, +ready to process the next bit. + +A big-endian CRC written this way would be coded like:: + + for (i = 0; i < input_bits; i++) { + multiple = remainder & 0x80000000 ? CRCPOLY : 0; + remainder = (remainder << 1 | next_input_bit()) ^ multiple; + } + +Notice how, to get at bit 32 of the shifted remainder, we look +at bit 31 of the remainder *before* shifting it. + +But also notice how the next_input_bit() bits we're shifting into +the remainder don't actually affect any decision-making until +32 bits later. Thus, the first 32 cycles of this are pretty boring. +Also, to add the CRC to a message, we need a 32-bit-long hole for it at +the end, so we have to add 32 extra cycles shifting in zeros at the +end of every message. + +These details lead to a standard trick: rearrange merging in the +next_input_bit() until the moment it's needed. Then the first 32 cycles +can be precomputed, and merging in the final 32 zero bits to make room +for the CRC can be skipped entirely. This changes the code to:: + + for (i = 0; i < input_bits; i++) { + remainder ^= next_input_bit() << 31; + multiple = (remainder & 0x80000000) ? CRCPOLY : 0; + remainder = (remainder << 1) ^ multiple; + } + +With this optimization, the little-endian code is particularly simple:: + + for (i = 0; i < input_bits; i++) { + remainder ^= next_input_bit(); + multiple = (remainder & 1) ? CRCPOLY : 0; + remainder = (remainder >> 1) ^ multiple; + } + +The most significant coefficient of the remainder polynomial is stored +in the least significant bit of the binary "remainder" variable. +The other details of endianness have been hidden in CRCPOLY (which must +be bit-reversed) and next_input_bit(). + +As long as next_input_bit is returning the bits in a sensible order, we don't +*have* to wait until the last possible moment to merge in additional bits. +We can do it 8 bits at a time rather than 1 bit at a time:: + + for (i = 0; i < input_bytes; i++) { + remainder ^= next_input_byte() << 24; + for (j = 0; j < 8; j++) { + multiple = (remainder & 0x80000000) ? CRCPOLY : 0; + remainder = (remainder << 1) ^ multiple; + } + } + +Or in little-endian:: + + for (i = 0; i < input_bytes; i++) { + remainder ^= next_input_byte(); + for (j = 0; j < 8; j++) { + multiple = (remainder & 1) ? CRCPOLY : 0; + remainder = (remainder >> 1) ^ multiple; + } + } + +If the input is a multiple of 32 bits, you can even XOR in a 32-bit +word at a time and increase the inner loop count to 32. + +You can also mix and match the two loop styles, for example doing the +bulk of a message byte-at-a-time and adding bit-at-a-time processing +for any fractional bytes at the end. + +To reduce the number of conditional branches, software commonly uses +the byte-at-a-time table method, popularized by Dilip V. Sarwate, +"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM +v.31 no.8 (August 1998) p. 1008-1013. + +Here, rather than just shifting one bit of the remainder to decide +in the correct multiple to subtract, we can shift a byte at a time. +This produces a 40-bit (rather than a 33-bit) intermediate remainder, +and the correct multiple of the polynomial to subtract is found using +a 256-entry lookup table indexed by the high 8 bits. + +(The table entries are simply the CRC-32 of the given one-byte messages.) + +When space is more constrained, smaller tables can be used, e.g. two +4-bit shifts followed by a lookup in a 16-entry table. + +It is not practical to process much more than 8 bits at a time using this +technique, because tables larger than 256 entries use too much memory and, +more importantly, too much of the L1 cache. + +To get higher software performance, a "slicing" technique can be used. +See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm", +ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf + +This does not change the number of table lookups, but does increase +the parallelism. With the classic Sarwate algorithm, each table lookup +must be completed before the index of the next can be computed. + +A "slicing by 2" technique would shift the remainder 16 bits at a time, +producing a 48-bit intermediate remainder. Rather than doing a single +lookup in a 65536-entry table, the two high bytes are looked up in +two different 256-entry tables. Each contains the remainder required +to cancel out the corresponding byte. The tables are different because the +polynomials to cancel are different. One has non-zero coefficients from +x^32 to x^39, while the other goes from x^40 to x^47. + +Since modern processors can handle many parallel memory operations, this +takes barely longer than a single table look-up and thus performs almost +twice as fast as the basic Sarwate algorithm. + +This can be extended to "slicing by 4" using 4 256-entry tables. +Each step, 32 bits of data is fetched, XORed with the CRC, and the result +broken into bytes and looked up in the tables. Because the 32-bit shift +leaves the low-order bits of the intermediate remainder zero, the +final CRC is simply the XOR of the 4 table look-ups. + +But this still enforces sequential execution: a second group of table +look-ups cannot begin until the previous groups 4 table look-ups have all +been completed. Thus, the processor's load/store unit is sometimes idle. + +To make maximum use of the processor, "slicing by 8" performs 8 look-ups +in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed +with 64 bits of input data. What is important to note is that 4 of +those 8 bytes are simply copies of the input data; they do not depend +on the previous CRC at all. Thus, those 4 table look-ups may commence +immediately, without waiting for the previous loop iteration. + +By always having 4 loads in flight, a modern superscalar processor can +be kept busy and make full use of its L1 cache. + +Two more details about CRC implementation in the real world: + +Normally, appending zero bits to a message which is already a multiple +of a polynomial produces a larger multiple of that polynomial. Thus, +a basic CRC will not detect appended zero bits (or bytes). To enable +a CRC to detect this condition, it's common to invert the CRC before +appending it. This makes the remainder of the message+crc come out not +as zero, but some fixed non-zero value. (The CRC of the inversion +pattern, 0xffffffff.) + +The same problem applies to zero bits prepended to the message, and a +similar solution is used. Instead of starting the CRC computation with +a remainder of 0, an initial remainder of all ones is used. As long as +you start the same way on decoding, it doesn't make a difference. diff --git a/Documentation/staging/index.rst b/Documentation/staging/index.rst new file mode 100644 index 000000000000..8e98517675ca --- /dev/null +++ b/Documentation/staging/index.rst @@ -0,0 +1,32 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Unsorted Documentation +====================== + +.. toctree:: + :maxdepth: 2 + + crc32 + kprobes + lzo + remoteproc + rpmsg + speculation + static-keys + tee + xz + +Atomic Types +============ + +.. literalinclude:: ../atomic_t.txt + +Atomic bitops +============= + +.. literalinclude:: ../atomic_bitops.txt + +Memory Barriers +=============== + +.. literalinclude:: ../memory-barriers.txt diff --git a/Documentation/staging/kprobes.rst b/Documentation/staging/kprobes.rst new file mode 100644 index 000000000000..8baab8832c5b --- /dev/null +++ b/Documentation/staging/kprobes.rst @@ -0,0 +1,801 @@ +======================= +Kernel Probes (Kprobes) +======================= + +:Author: Jim Keniston <jkenisto@us.ibm.com> +:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com> +:Author: Masami Hiramatsu <mhiramat@redhat.com> + +.. CONTENTS + + 1. Concepts: Kprobes, and Return Probes + 2. Architectures Supported + 3. Configuring Kprobes + 4. API Reference + 5. Kprobes Features and Limitations + 6. Probe Overhead + 7. TODO + 8. Kprobes Example + 9. Kretprobes Example + 10. Deprecated Features + Appendix A: The kprobes debugfs interface + Appendix B: The kprobes sysctl interface + +Concepts: Kprobes and Return Probes +========================================= + +Kprobes enables you to dynamically break into any kernel routine and +collect debugging and performance information non-disruptively. You +can trap at almost any kernel code address [1]_, specifying a handler +routine to be invoked when the breakpoint is hit. + +.. [1] some parts of the kernel code can not be trapped, see + :ref:`kprobes_blacklist`) + +There are currently two types of probes: kprobes, and kretprobes +(also called return probes). A kprobe can be inserted on virtually +any instruction in the kernel. A return probe fires when a specified +function returns. + +In the typical case, Kprobes-based instrumentation is packaged as +a kernel module. The module's init function installs ("registers") +one or more probes, and the exit function unregisters them. A +registration function such as register_kprobe() specifies where +the probe is to be inserted and what handler is to be called when +the probe is hit. + +There are also ``register_/unregister_*probes()`` functions for batch +registration/unregistration of a group of ``*probes``. These functions +can speed up unregistration process when you have to unregister +a lot of probes at once. + +The next four subsections explain how the different types of +probes work and how jump optimization works. They explain certain +things that you'll need to know in order to make the best use of +Kprobes -- e.g., the difference between a pre_handler and +a post_handler, and how to use the maxactive and nmissed fields of +a kretprobe. But if you're in a hurry to start using Kprobes, you +can skip ahead to :ref:`kprobes_archs_supported`. + +How Does a Kprobe Work? +----------------------- + +When a kprobe is registered, Kprobes makes a copy of the probed +instruction and replaces the first byte(s) of the probed instruction +with a breakpoint instruction (e.g., int3 on i386 and x86_64). + +When a CPU hits the breakpoint instruction, a trap occurs, the CPU's +registers are saved, and control passes to Kprobes via the +notifier_call_chain mechanism. Kprobes executes the "pre_handler" +associated with the kprobe, passing the handler the addresses of the +kprobe struct and the saved registers. + +Next, Kprobes single-steps its copy of the probed instruction. +(It would be simpler to single-step the actual instruction in place, +but then Kprobes would have to temporarily remove the breakpoint +instruction. This would open a small time window when another CPU +could sail right past the probepoint.) + +After the instruction is single-stepped, Kprobes executes the +"post_handler," if any, that is associated with the kprobe. +Execution then continues with the instruction following the probepoint. + +Changing Execution Path +----------------------- + +Since kprobes can probe into a running kernel code, it can change the +register set, including instruction pointer. This operation requires +maximum care, such as keeping the stack frame, recovering the execution +path etc. Since it operates on a running kernel and needs deep knowledge +of computer architecture and concurrent computing, you can easily shoot +your foot. + +If you change the instruction pointer (and set up other related +registers) in pre_handler, you must return !0 so that kprobes stops +single stepping and just returns to the given address. +This also means post_handler should not be called anymore. + +Note that this operation may be harder on some architectures which use +TOC (Table of Contents) for function call, since you have to setup a new +TOC for your function in your module, and recover the old one after +returning from it. + +Return Probes +------------- + +How Does a Return Probe Work? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When you call register_kretprobe(), Kprobes establishes a kprobe at +the entry to the function. When the probed function is called and this +probe is hit, Kprobes saves a copy of the return address, and replaces +the return address with the address of a "trampoline." The trampoline +is an arbitrary piece of code -- typically just a nop instruction. +At boot time, Kprobes registers a kprobe at the trampoline. + +When the probed function executes its return instruction, control +passes to the trampoline and that probe is hit. Kprobes' trampoline +handler calls the user-specified return handler associated with the +kretprobe, then sets the saved instruction pointer to the saved return +address, and that's where execution resumes upon return from the trap. + +While the probed function is executing, its return address is +stored in an object of type kretprobe_instance. Before calling +register_kretprobe(), the user sets the maxactive field of the +kretprobe struct to specify how many instances of the specified +function can be probed simultaneously. register_kretprobe() +pre-allocates the indicated number of kretprobe_instance objects. + +For example, if the function is non-recursive and is called with a +spinlock held, maxactive = 1 should be enough. If the function is +non-recursive and can never relinquish the CPU (e.g., via a semaphore +or preemption), NR_CPUS should be enough. If maxactive <= 0, it is +set to a default value. If CONFIG_PREEMPT is enabled, the default +is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS. + +It's not a disaster if you set maxactive too low; you'll just miss +some probes. In the kretprobe struct, the nmissed field is set to +zero when the return probe is registered, and is incremented every +time the probed function is entered but there is no kretprobe_instance +object available for establishing the return probe. + +Kretprobe entry-handler +^^^^^^^^^^^^^^^^^^^^^^^ + +Kretprobes also provides an optional user-specified handler which runs +on function entry. This handler is specified by setting the entry_handler +field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the +function entry is hit, the user-defined entry_handler, if any, is invoked. +If the entry_handler returns 0 (success) then a corresponding return handler +is guaranteed to be called upon function return. If the entry_handler +returns a non-zero error then Kprobes leaves the return address as is, and +the kretprobe has no further effect for that particular function instance. + +Multiple entry and return handler invocations are matched using the unique +kretprobe_instance object associated with them. Additionally, a user +may also specify per return-instance private data to be part of each +kretprobe_instance object. This is especially useful when sharing private +data between corresponding user entry and return handlers. The size of each +private data object can be specified at kretprobe registration time by +setting the data_size field of the kretprobe struct. This data can be +accessed through the data field of each kretprobe_instance object. + +In case probed function is entered but there is no kretprobe_instance +object available, then in addition to incrementing the nmissed count, +the user entry_handler invocation is also skipped. + +.. _kprobes_jump_optimization: + +How Does Jump Optimization Work? +-------------------------------- + +If your kernel is built with CONFIG_OPTPROBES=y (currently this flag +is automatically set 'y' on x86/x86-64, non-preemptive kernel) and +the "debug.kprobes_optimization" kernel parameter is set to 1 (see +sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump +instruction instead of a breakpoint instruction at each probepoint. + +Init a Kprobe +^^^^^^^^^^^^^ + +When a probe is registered, before attempting this optimization, +Kprobes inserts an ordinary, breakpoint-based kprobe at the specified +address. So, even if it's not possible to optimize this particular +probepoint, there'll be a probe there. + +Safety Check +^^^^^^^^^^^^ + +Before optimizing a probe, Kprobes performs the following safety checks: + +- Kprobes verifies that the region that will be replaced by the jump + instruction (the "optimized region") lies entirely within one function. + (A jump instruction is multiple bytes, and so may overlay multiple + instructions.) + +- Kprobes analyzes the entire function and verifies that there is no + jump into the optimized region. Specifically: + + - the function contains no indirect jump; + - the function contains no instruction that causes an exception (since + the fixup code triggered by the exception could jump back into the + optimized region -- Kprobes checks the exception tables to verify this); + - there is no near jump to the optimized region (other than to the first + byte). + +- For each instruction in the optimized region, Kprobes verifies that + the instruction can be executed out of line. + +Preparing Detour Buffer +^^^^^^^^^^^^^^^^^^^^^^^ + +Next, Kprobes prepares a "detour" buffer, which contains the following +instruction sequence: + +- code to push the CPU's registers (emulating a breakpoint trap) +- a call to the trampoline code which calls user's probe handlers. +- code to restore registers +- the instructions from the optimized region +- a jump back to the original execution path. + +Pre-optimization +^^^^^^^^^^^^^^^^ + +After preparing the detour buffer, Kprobes verifies that none of the +following situations exist: + +- The probe has a post_handler. +- Other instructions in the optimized region are probed. +- The probe is disabled. + +In any of the above cases, Kprobes won't start optimizing the probe. +Since these are temporary situations, Kprobes tries to start +optimizing it again if the situation is changed. + +If the kprobe can be optimized, Kprobes enqueues the kprobe to an +optimizing list, and kicks the kprobe-optimizer workqueue to optimize +it. If the to-be-optimized probepoint is hit before being optimized, +Kprobes returns control to the original instruction path by setting +the CPU's instruction pointer to the copied code in the detour buffer +-- thus at least avoiding the single-step. + +Optimization +^^^^^^^^^^^^ + +The Kprobe-optimizer doesn't insert the jump instruction immediately; +rather, it calls synchronize_rcu() for safety first, because it's +possible for a CPU to be interrupted in the middle of executing the +optimized region [3]_. As you know, synchronize_rcu() can ensure +that all interruptions that were active when synchronize_rcu() +was called are done, but only if CONFIG_PREEMPT=n. So, this version +of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_. + +After that, the Kprobe-optimizer calls stop_machine() to replace +the optimized region with a jump instruction to the detour buffer, +using text_poke_smp(). + +Unoptimization +^^^^^^^^^^^^^^ + +When an optimized kprobe is unregistered, disabled, or blocked by +another kprobe, it will be unoptimized. If this happens before +the optimization is complete, the kprobe is just dequeued from the +optimized list. If the optimization has been done, the jump is +replaced with the original code (except for an int3 breakpoint in +the first byte) by using text_poke_smp(). + +.. [3] Please imagine that the 2nd instruction is interrupted and then + the optimizer replaces the 2nd instruction with the jump *address* + while the interrupt handler is running. When the interrupt + returns to original address, there is no valid instruction, + and it causes an unexpected result. + +.. [4] This optimization-safety checking may be replaced with the + stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y + kernel. + +NOTE for geeks: +The jump optimization changes the kprobe's pre_handler behavior. +Without optimization, the pre_handler can change the kernel's execution +path by changing regs->ip and returning 1. However, when the probe +is optimized, that modification is ignored. Thus, if you want to +tweak the kernel's execution path, you need to suppress optimization, +using one of the following techniques: + +- Specify an empty function for the kprobe's post_handler. + +or + +- Execute 'sysctl -w debug.kprobes_optimization=n' + +.. _kprobes_blacklist: + +Blacklist +--------- + +Kprobes can probe most of the kernel except itself. This means +that there are some functions where kprobes cannot probe. Probing +(trapping) such functions can cause a recursive trap (e.g. double +fault) or the nested probe handler may never be called. +Kprobes manages such functions as a blacklist. +If you want to add a function into the blacklist, you just need +to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro +to specify a blacklisted function. +Kprobes checks the given probe address against the blacklist and +rejects registering it, if the given address is in the blacklist. + +.. _kprobes_archs_supported: + +Architectures Supported +======================= + +Kprobes and return probes are implemented on the following +architectures: + +- i386 (Supports jump optimization) +- x86_64 (AMD-64, EM64T) (Supports jump optimization) +- ppc64 +- ia64 (Does not support probes on instruction slot1.) +- sparc64 (Return probes not yet implemented.) +- arm +- ppc +- mips +- s390 +- parisc + +Configuring Kprobes +=================== + +When configuring the kernel using make menuconfig/xconfig/oldconfig, +ensure that CONFIG_KPROBES is set to "y". Under "General setup", look +for "Kprobes". + +So that you can load and unload Kprobes-based instrumentation modules, +make sure "Loadable module support" (CONFIG_MODULES) and "Module +unloading" (CONFIG_MODULE_UNLOAD) are set to "y". + +Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL +are set to "y", since kallsyms_lookup_name() is used by the in-kernel +kprobe address resolution code. + +If you need to insert a probe in the middle of a function, you may find +it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), +so you can use "objdump -d -l vmlinux" to see the source-to-object +code mapping. + +API Reference +============= + +The Kprobes API includes a "register" function and an "unregister" +function for each type of probe. The API also includes "register_*probes" +and "unregister_*probes" functions for (un)registering arrays of probes. +Here are terse, mini-man-page specifications for these functions and +the associated probe handlers that you'll write. See the files in the +samples/kprobes/ sub-directory for examples. + +register_kprobe +--------------- + +:: + + #include <linux/kprobes.h> + int register_kprobe(struct kprobe *kp); + +Sets a breakpoint at the address kp->addr. When the breakpoint is +hit, Kprobes calls kp->pre_handler. After the probed instruction +is single-stepped, Kprobe calls kp->post_handler. If a fault +occurs during execution of kp->pre_handler or kp->post_handler, +or during single-stepping of the probed instruction, Kprobes calls +kp->fault_handler. Any or all handlers can be NULL. If kp->flags +is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled, +so, its handlers aren't hit until calling enable_kprobe(kp). + +.. note:: + + 1. With the introduction of the "symbol_name" field to struct kprobe, + the probepoint address resolution will now be taken care of by the kernel. + The following will now work:: + + kp.symbol_name = "symbol_name"; + + (64-bit powerpc intricacies such as function descriptors are handled + transparently) + + 2. Use the "offset" field of struct kprobe if the offset into the symbol + to install a probepoint is known. This field is used to calculate the + probepoint. + + 3. Specify either the kprobe "symbol_name" OR the "addr". If both are + specified, kprobe registration will fail with -EINVAL. + + 4. With CISC architectures (such as i386 and x86_64), the kprobes code + does not validate if the kprobe.addr is at an instruction boundary. + Use "offset" with caution. + +register_kprobe() returns 0 on success, or a negative errno otherwise. + +User's pre-handler (kp->pre_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int pre_handler(struct kprobe *p, struct pt_regs *regs); + +Called with p pointing to the kprobe associated with the breakpoint, +and regs pointing to the struct containing the registers saved when +the breakpoint was hit. Return 0 here unless you're a Kprobes geek. + +User's post-handler (kp->post_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + void post_handler(struct kprobe *p, struct pt_regs *regs, + unsigned long flags); + +p and regs are as described for the pre_handler. flags always seems +to be zero. + +User's fault-handler (kp->fault_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); + +p and regs are as described for the pre_handler. trapnr is the +architecture-specific trap number associated with the fault (e.g., +on i386, 13 for a general protection fault or 14 for a page fault). +Returns 1 if it successfully handled the exception. + +register_kretprobe +------------------ + +:: + + #include <linux/kprobes.h> + int register_kretprobe(struct kretprobe *rp); + +Establishes a return probe for the function whose address is +rp->kp.addr. When that function returns, Kprobes calls rp->handler. +You must set rp->maxactive appropriately before you call +register_kretprobe(); see "How Does a Return Probe Work?" for details. + +register_kretprobe() returns 0 on success, or a negative errno +otherwise. + +User's return-probe handler (rp->handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int kretprobe_handler(struct kretprobe_instance *ri, + struct pt_regs *regs); + +regs is as described for kprobe.pre_handler. ri points to the +kretprobe_instance object, of which the following fields may be +of interest: + +- ret_addr: the return address +- rp: points to the corresponding kretprobe object +- task: points to the corresponding task struct +- data: points to per return-instance private data; see "Kretprobe + entry-handler" for details. + +The regs_return_value(regs) macro provides a simple abstraction to +extract the return value from the appropriate register as defined by +the architecture's ABI. + +The handler's return value is currently ignored. + +unregister_*probe +------------------ + +:: + + #include <linux/kprobes.h> + void unregister_kprobe(struct kprobe *kp); + void unregister_kretprobe(struct kretprobe *rp); + +Removes the specified probe. The unregister function can be called +at any time after the probe has been registered. + +.. note:: + + If the functions find an incorrect probe (ex. an unregistered probe), + they clear the addr field of the probe. + +register_*probes +---------------- + +:: + + #include <linux/kprobes.h> + int register_kprobes(struct kprobe **kps, int num); + int register_kretprobes(struct kretprobe **rps, int num); + +Registers each of the num probes in the specified array. If any +error occurs during registration, all probes in the array, up to +the bad probe, are safely unregistered before the register_*probes +function returns. + +- kps/rps: an array of pointers to ``*probe`` data structures +- num: the number of the array entries. + +.. note:: + + You have to allocate(or define) an array of pointers and set all + of the array entries before using these functions. + +unregister_*probes +------------------ + +:: + + #include <linux/kprobes.h> + void unregister_kprobes(struct kprobe **kps, int num); + void unregister_kretprobes(struct kretprobe **rps, int num); + +Removes each of the num probes in the specified array at once. + +.. note:: + + If the functions find some incorrect probes (ex. unregistered + probes) in the specified array, they clear the addr field of those + incorrect probes. However, other probes in the array are + unregistered correctly. + +disable_*probe +-------------- + +:: + + #include <linux/kprobes.h> + int disable_kprobe(struct kprobe *kp); + int disable_kretprobe(struct kretprobe *rp); + +Temporarily disables the specified ``*probe``. You can enable it again by using +enable_*probe(). You must specify the probe which has been registered. + +enable_*probe +------------- + +:: + + #include <linux/kprobes.h> + int enable_kprobe(struct kprobe *kp); + int enable_kretprobe(struct kretprobe *rp); + +Enables ``*probe`` which has been disabled by disable_*probe(). You must specify +the probe which has been registered. + +Kprobes Features and Limitations +================================ + +Kprobes allows multiple probes at the same address. Also, +a probepoint for which there is a post_handler cannot be optimized. +So if you install a kprobe with a post_handler, at an optimized +probepoint, the probepoint will be unoptimized automatically. + +In general, you can install a probe anywhere in the kernel. +In particular, you can probe interrupt handlers. Known exceptions +are discussed in this section. + +The register_*probe functions will return -EINVAL if you attempt +to install a probe in the code that implements Kprobes (mostly +kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such +as do_page_fault and notifier_call_chain). + +If you install a probe in an inline-able function, Kprobes makes +no attempt to chase down all inline instances of the function and +install probes there. gcc may inline a function without being asked, +so keep this in mind if you're not seeing the probe hits you expect. + +A probe handler can modify the environment of the probed function +-- e.g., by modifying kernel data structures, or by modifying the +contents of the pt_regs struct (which are restored to the registers +upon return from the breakpoint). So Kprobes can be used, for example, +to install a bug fix or to inject faults for testing. Kprobes, of +course, has no way to distinguish the deliberately injected faults +from the accidental ones. Don't drink and probe. + +Kprobes makes no attempt to prevent probe handlers from stepping on +each other -- e.g., probing printk() and then calling printk() from a +probe handler. If a probe handler hits a probe, that second probe's +handlers won't be run in that instance, and the kprobe.nmissed member +of the second probe will be incremented. + +As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of +the same handler) may run concurrently on different CPUs. + +Kprobes does not use mutexes or allocate memory except during +registration and unregistration. + +Probe handlers are run with preemption disabled or interrupt disabled, +which depends on the architecture and optimization state. (e.g., +kretprobe handlers and optimized kprobe handlers run without interrupt +disabled on x86/x86-64). In any case, your handler should not yield +the CPU (e.g., by attempting to acquire a semaphore, or waiting I/O). + +Since a return probe is implemented by replacing the return +address with the trampoline's address, stack backtraces and calls +to __builtin_return_address() will typically yield the trampoline's +address instead of the real return address for kretprobed functions. +(As far as we can tell, __builtin_return_address() is used only +for instrumentation and error reporting.) + +If the number of times a function is called does not match the number +of times it returns, registering a return probe on that function may +produce undesirable results. In such a case, a line: +kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c +gets printed. With this information, one will be able to correlate the +exact instance of the kretprobe that caused the problem. We have the +do_exit() case covered. do_execve() and do_fork() are not an issue. +We're unaware of other specific cases where this could be a problem. + +If, upon entry to or exit from a function, the CPU is running on +a stack other than that of the current task, registering a return +probe on that function may produce undesirable results. For this +reason, Kprobes doesn't support return probes (or kprobes) +on the x86_64 version of __switch_to(); the registration functions +return -EINVAL. + +On x86/x86-64, since the Jump Optimization of Kprobes modifies +instructions widely, there are some limitations to optimization. To +explain it, we introduce some terminology. Imagine a 3-instruction +sequence consisting of a two 2-byte instructions and one 3-byte +instruction. + +:: + + IA + | + [-2][-1][0][1][2][3][4][5][6][7] + [ins1][ins2][ ins3 ] + [<- DCR ->] + [<- JTPR ->] + + ins1: 1st Instruction + ins2: 2nd Instruction + ins3: 3rd Instruction + IA: Insertion Address + JTPR: Jump Target Prohibition Region + DCR: Detoured Code Region + +The instructions in DCR are copied to the out-of-line buffer +of the kprobe, because the bytes in DCR are replaced by +a 5-byte jump instruction. So there are several limitations. + +a) The instructions in DCR must be relocatable. +b) The instructions in DCR must not include a call instruction. +c) JTPR must not be targeted by any jump or call instruction. +d) DCR must not straddle the border between functions. + +Anyway, these limitations are checked by the in-kernel instruction +decoder, so you don't need to worry about that. + +Probe Overhead +============== + +On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 +microseconds to process. Specifically, a benchmark that hits the same +probepoint repeatedly, firing a simple handler each time, reports 1-2 +million hits per second, depending on the architecture. A return-probe +hit typically takes 50-75% longer than a kprobe hit. +When you have a return probe set on a function, adding a kprobe at +the entry to that function adds essentially no overhead. + +Here are sample overhead figures (in usec) for different architectures:: + + k = kprobe; r = return probe; kr = kprobe + return probe + on same function + + i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips + k = 0.57 usec; r = 0.92; kr = 0.99 + + x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips + k = 0.49 usec; r = 0.80; kr = 0.82 + + ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) + k = 0.77 usec; r = 1.26; kr = 1.45 + +Optimized Probe Overhead +------------------------ + +Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to +process. Here are sample overhead figures (in usec) for x86 architectures:: + + k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe, + r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe. + + i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips + k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 + + x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips + k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 + +TODO +==== + +a. SystemTap (http://sourceware.org/systemtap): Provides a simplified + programming interface for probe-based instrumentation. Try it out. +b. Kernel return probes for sparc64. +c. Support for other architectures. +d. User-space probes. +e. Watchpoint probes (which fire on data references). + +Kprobes Example +=============== + +See samples/kprobes/kprobe_example.c + +Kretprobes Example +================== + +See samples/kprobes/kretprobe_example.c + +For additional information on Kprobes, refer to the following URLs: + +- http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe +- http://www.redhat.com/magazine/005mar05/features/kprobes/ +- http://www-users.cs.umn.edu/~boutcher/kprobes/ +- http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) + +Deprecated Features +=================== + +Jprobes is now a deprecated feature. People who are depending on it should +migrate to other tracing features or use older kernels. Please consider to +migrate your tool to one of the following options: + +- Use trace-event to trace target function with arguments. + + trace-event is a low-overhead (and almost no visible overhead if it + is off) statically defined event interface. You can define new events + and trace it via ftrace or any other tracing tools. + + See the following urls: + + - https://lwn.net/Articles/379903/ + - https://lwn.net/Articles/381064/ + - https://lwn.net/Articles/383362/ + +- Use ftrace dynamic events (kprobe event) with perf-probe. + + If you build your kernel with debug info (CONFIG_DEBUG_INFO=y), you can + find which register/stack is assigned to which local variable or arguments + by using perf-probe and set up new event to trace it. + + See following documents: + + - Documentation/trace/kprobetrace.rst + - Documentation/trace/events.rst + - tools/perf/Documentation/perf-probe.txt + + +The kprobes debugfs interface +============================= + + +With recent kernels (> 2.6.20) the list of registered kprobes is visible +under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug). + +/sys/kernel/debug/kprobes/list: Lists all registered probes on the system:: + + c015d71a k vfs_read+0x0 + c03dedc5 r tcp_v4_rcv+0x0 + +The first column provides the kernel address where the probe is inserted. +The second column identifies the type of probe (k - kprobe and r - kretprobe) +while the third column specifies the symbol+offset of the probe. +If the probed function belongs to a module, the module name is also +specified. Following columns show probe status. If the probe is on +a virtual address that is no longer valid (module init sections, module +virtual addresses that correspond to modules that've been unloaded), +such probes are marked with [GONE]. If the probe is temporarily disabled, +such probes are marked with [DISABLED]. If the probe is optimized, it is +marked with [OPTIMIZED]. If the probe is ftrace-based, it is marked with +[FTRACE]. + +/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. + +Provides a knob to globally and forcibly turn registered kprobes ON or OFF. +By default, all kprobes are enabled. By echoing "0" to this file, all +registered probes will be disarmed, till such time a "1" is echoed to this +file. Note that this knob just disarms and arms all kprobes and doesn't +change each probe's disabling state. This means that disabled kprobes (marked +[DISABLED]) will be not enabled if you turn ON all kprobes by this knob. + + +The kprobes sysctl interface +============================ + +/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF. + +When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides +a knob to globally and forcibly turn jump optimization (see section +:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization +is allowed (ON). If you echo "0" to this file or set +"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be +unoptimized, and any new probes registered after that will not be optimized. + +Note that this knob *changes* the optimized state. This means that optimized +probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be +removed). If the knob is turned on, they will be optimized again. + diff --git a/Documentation/staging/lzo.rst b/Documentation/staging/lzo.rst new file mode 100644 index 000000000000..f65b51523014 --- /dev/null +++ b/Documentation/staging/lzo.rst @@ -0,0 +1,202 @@ +=========================================================== +LZO stream format as understood by Linux's LZO decompressor +=========================================================== + +Introduction +============ + + This is not a specification. No specification seems to be publicly available + for the LZO stream format. This document describes what input format the LZO + decompressor as implemented in the Linux kernel understands. The file subject + of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on + the compressor nor on any other implementations though it seems likely that + the format matches the standard one. The purpose of this document is to + better understand what the code does in order to propose more efficient fixes + for future bug reports. + +Description +=========== + + The stream is composed of a series of instructions, operands, and data. The + instructions consist in a few bits representing an opcode, and bits forming + the operands for the instruction, whose size and position depend on the + opcode and on the number of literals copied by previous instruction. The + operands are used to indicate: + + - a distance when copying data from the dictionary (past output buffer) + - a length (number of bytes to copy from dictionary) + - the number of literals to copy, which is retained in variable "state" + as a piece of information for next instructions. + + Optionally depending on the opcode and operands, extra data may follow. These + extra data can be a complement for the operand (eg: a length or a distance + encoded on larger values), or a literal to be copied to the output buffer. + + The first byte of the block follows a different encoding from other bytes, it + seems to be optimized for literal use only, since there is no dictionary yet + prior to that byte. + + Lengths are always encoded on a variable size starting with a small number + of bits in the operand. If the number of bits isn't enough to represent the + length, up to 255 may be added in increments by consuming more bytes with a + rate of at most 255 per extra byte (thus the compression ratio cannot exceed + around 255:1). The variable length encoding using #bits is always the same:: + + length = byte & ((1 << #bits) - 1) + if (!length) { + length = ((1 << #bits) - 1) + length += 255*(number of zero bytes) + length += first-non-zero-byte + } + length += constant (generally 2 or 3) + + For references to the dictionary, distances are relative to the output + pointer. Distances are encoded using very few bits belonging to certain + ranges, resulting in multiple copy instructions using different encodings. + Certain encodings involve one extra byte, others involve two extra bytes + forming a little-endian 16-bit quantity (marked LE16 below). + + After any instruction except the large literal copy, 0, 1, 2 or 3 literals + are copied before starting the next instruction. The number of literals that + were copied may change the meaning and behaviour of the next instruction. In + practice, only one instruction needs to know whether 0, less than 4, or more + literals were copied. This is the information stored in the <state> variable + in this implementation. This number of immediate literals to be copied is + generally encoded in the last two bits of the instruction but may also be + taken from the last two bits of an extra operand (eg: distance). + + End of stream is declared when a block copy of distance 0 is seen. Only one + instruction may encode this distance (0001HLLL), it takes one LE16 operand + for the distance, thus requiring 3 bytes. + + .. important:: + + In the code some length checks are missing because certain instructions + are called under the assumption that a certain number of bytes follow + because it has already been guaranteed before parsing the instructions. + They just have to "refill" this credit if they consume extra bytes. This + is an implementation design choice independent on the algorithm or + encoding. + +Versions + +0: Original version +1: LZO-RLE + +Version 1 of LZO implements an extension to encode runs of zeros using run +length encoding. This improves speed for data with many zeros, which is a +common case for zram. This modifies the bitstream in a backwards compatible way +(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data). + +For maximum compatibility, both versions are available under different names +(lzo and lzo-rle). Differences in the encoding are noted in this document with +e.g.: version 1 only. + +Byte sequences +============== + + First byte encoding:: + + 0..16 : follow regular instruction encoding, see below. It is worth + noting that code 16 will represent a block copy from the + dictionary which is empty, and that it will always be + invalid at this place. + + 17 : bitstream version. If the first byte is 17, and compressed + stream length is at least 5 bytes (length of shortest possible + versioned bitstream), the next byte gives the bitstream version + (version 1 only). + Otherwise, the bitstream version is 0. + + 18..21 : copy 0..3 literals + state = (byte - 17) = 0..3 [ copy <state> literals ] + skip byte + + 22..255 : copy literal string + length = (byte - 17) = 4..238 + state = 4 [ don't copy extra literals ] + skip byte + + Instruction encoding:: + + 0 0 0 0 X X X X (0..15) + Depends on the number of literals copied by the last instruction. + If last instruction did not copy any literal (state == 0), this + encoding will be a copy of 4 or more literal, and must be interpreted + like this : + + 0 0 0 0 L L L L (0..15) : copy long literal string + length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte) + state = 4 (no extra literals are copied) + + If last instruction used to copy between 1 to 3 literals (encoded in + the instruction's opcode or distance), the instruction is a copy of a + 2-byte block from the dictionary within a 1kB distance. It is worth + noting that this instruction provides little savings since it uses 2 + bytes to encode a copy of 2 other bytes but it encodes the number of + following literals for free. It must be interpreted like this : + + 0 0 0 0 D D S S (0..15) : copy 2 bytes from <= 1kB distance + length = 2 + state = S (copy S literals after this block) + Always followed by exactly one byte : H H H H H H H H + distance = (H << 2) + D + 1 + + If last instruction used to copy 4 or more literals (as detected by + state == 4), the instruction becomes a copy of a 3-byte block from the + dictionary from a 2..3kB distance, and must be interpreted like this : + + 0 0 0 0 D D S S (0..15) : copy 3 bytes from 2..3 kB distance + length = 3 + state = S (copy S literals after this block) + Always followed by exactly one byte : H H H H H H H H + distance = (H << 2) + D + 2049 + + 0 0 0 1 H L L L (16..31) + Copy of a block within 16..48kB distance (preferably less than 10B) + length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte) + Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S + distance = 16384 + (H << 14) + D + state = S (copy S literals after this block) + End of stream is reached if distance == 16384 + In version 1 only, to prevent ambiguity with the RLE case when + ((distance & 0x803f) == 0x803f) && (261 <= length <= 264), the + compressor must not emit block copies where distance and length + meet these conditions. + + In version 1 only, this instruction is also used to encode a run of + zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1. + In this case, it is followed by a fourth byte, X. + run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4 + + 0 0 1 L L L L L (32..63) + Copy of small block within 16kB distance (preferably less than 34B) + length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte) + Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S + distance = D + 1 + state = S (copy S literals after this block) + + 0 1 L D D D S S (64..127) + Copy 3-4 bytes from block within 2kB distance + state = S (copy S literals after this block) + length = 3 + L + Always followed by exactly one byte : H H H H H H H H + distance = (H << 3) + D + 1 + + 1 L L D D D S S (128..255) + Copy 5-8 bytes from block within 2kB distance + state = S (copy S literals after this block) + length = 5 + L + Always followed by exactly one byte : H H H H H H H H + distance = (H << 3) + D + 1 + +Authors +======= + + This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an + analysis of the decompression code available in Linux 3.16-rc5, and updated + by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length + encoding. The code is tricky, it is possible that this document contains + mistakes or that a few corner cases were overlooked. In any case, please + report any doubt, fix, or proposed updates to the author(s) so that the + document can be updated. diff --git a/Documentation/staging/remoteproc.rst b/Documentation/staging/remoteproc.rst new file mode 100644 index 000000000000..9cccd3dd6a4b --- /dev/null +++ b/Documentation/staging/remoteproc.rst @@ -0,0 +1,359 @@ +========================== +Remote Processor Framework +========================== + +Introduction +============ + +Modern SoCs typically have heterogeneous remote processor devices in asymmetric +multiprocessing (AMP) configurations, which may be running different instances +of operating system, whether it's Linux or any other flavor of real-time OS. + +OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP. +In a typical configuration, the dual cortex-A9 is running Linux in a SMP +configuration, and each of the other three cores (two M3 cores and a DSP) +is running its own instance of RTOS in an AMP configuration. + +The remoteproc framework allows different platforms/architectures to +control (power on, load firmware, power off) those remote processors while +abstracting the hardware differences, so the entire driver doesn't need to be +duplicated. In addition, this framework also adds rpmsg virtio devices +for remote processors that supports this kind of communication. This way, +platform-specific remoteproc drivers only need to provide a few low-level +handlers, and then all rpmsg drivers will then just work +(for more information about the virtio-based rpmsg bus and its drivers, +please read Documentation/staging/rpmsg.rst). +Registration of other types of virtio devices is now also possible. Firmwares +just need to publish what kind of virtio devices do they support, and then +remoteproc will add those devices. This makes it possible to reuse the +existing virtio drivers with remote processor backends at a minimal development +cost. + +User API +======== + +:: + + int rproc_boot(struct rproc *rproc) + +Boot a remote processor (i.e. load its firmware, power it on, ...). + +If the remote processor is already powered on, this function immediately +returns (successfully). + +Returns 0 on success, and an appropriate error value otherwise. +Note: to use this function you should already have a valid rproc +handle. There are several ways to achieve that cleanly (devres, pdata, +the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we +might also consider using dev_archdata for this). + +:: + + void rproc_shutdown(struct rproc *rproc) + +Power off a remote processor (previously booted with rproc_boot()). +In case @rproc is still being used by an additional user(s), then +this function will just decrement the power refcount and exit, +without really powering off the device. + +Every call to rproc_boot() must (eventually) be accompanied by a call +to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug. + +.. note:: + + we're not decrementing the rproc's refcount, only the power refcount. + which means that the @rproc handle stays valid even after + rproc_shutdown() returns, and users can still use it with a subsequent + rproc_boot(), if needed. + +:: + + struct rproc *rproc_get_by_phandle(phandle phandle) + +Find an rproc handle using a device tree phandle. Returns the rproc +handle on success, and NULL on failure. This function increments +the remote processor's refcount, so always use rproc_put() to +decrement it back once rproc isn't needed anymore. + +Typical usage +============= + +:: + + #include <linux/remoteproc.h> + + /* in case we were given a valid 'rproc' handle */ + int dummy_rproc_example(struct rproc *my_rproc) + { + int ret; + + /* let's power on and boot our remote processor */ + ret = rproc_boot(my_rproc); + if (ret) { + /* + * something went wrong. handle it and leave. + */ + } + + /* + * our remote processor is now powered on... give it some work + */ + + /* let's shut it down now */ + rproc_shutdown(my_rproc); + } + +API for implementors +==================== + +:: + + struct rproc *rproc_alloc(struct device *dev, const char *name, + const struct rproc_ops *ops, + const char *firmware, int len) + +Allocate a new remote processor handle, but don't register +it yet. Required parameters are the underlying device, the +name of this remote processor, platform-specific ops handlers, +the name of the firmware to boot this rproc with, and the +length of private data needed by the allocating rproc driver (in bytes). + +This function should be used by rproc implementations during +initialization of the remote processor. + +After creating an rproc handle using this function, and when ready, +implementations should then call rproc_add() to complete +the registration of the remote processor. + +On success, the new rproc is returned, and on failure, NULL. + +.. note:: + + **never** directly deallocate @rproc, even if it was not registered + yet. Instead, when you need to unroll rproc_alloc(), use rproc_free(). + +:: + + void rproc_free(struct rproc *rproc) + +Free an rproc handle that was allocated by rproc_alloc. + +This function essentially unrolls rproc_alloc(), by decrementing the +rproc's refcount. It doesn't directly free rproc; that would happen +only if there are no other references to rproc and its refcount now +dropped to zero. + +:: + + int rproc_add(struct rproc *rproc) + +Register @rproc with the remoteproc framework, after it has been +allocated with rproc_alloc(). + +This is called by the platform-specific rproc implementation, whenever +a new remote processor device is probed. + +Returns 0 on success and an appropriate error code otherwise. +Note: this function initiates an asynchronous firmware loading +context, which will look for virtio devices supported by the rproc's +firmware. + +If found, those virtio devices will be created and added, so as a result +of registering this remote processor, additional virtio drivers might get +probed. + +:: + + int rproc_del(struct rproc *rproc) + +Unroll rproc_add(). + +This function should be called when the platform specific rproc +implementation decides to remove the rproc device. it should +_only_ be called if a previous invocation of rproc_add() +has completed successfully. + +After rproc_del() returns, @rproc is still valid, and its +last refcount should be decremented by calling rproc_free(). + +Returns 0 on success and -EINVAL if @rproc isn't valid. + +:: + + void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) + +Report a crash in a remoteproc + +This function must be called every time a crash is detected by the +platform specific rproc implementation. This should not be called from a +non-remoteproc driver. This function can be called from atomic/interrupt +context. + +Implementation callbacks +======================== + +These callbacks should be provided by platform-specific remoteproc +drivers:: + + /** + * struct rproc_ops - platform-specific device handlers + * @start: power on the device and boot it + * @stop: power off the device + * @kick: kick a virtqueue (virtqueue id given as a parameter) + */ + struct rproc_ops { + int (*start)(struct rproc *rproc); + int (*stop)(struct rproc *rproc); + void (*kick)(struct rproc *rproc, int vqid); + }; + +Every remoteproc implementation should at least provide the ->start and ->stop +handlers. If rpmsg/virtio functionality is also desired, then the ->kick handler +should be provided as well. + +The ->start() handler takes an rproc handle and should then power on the +device and boot it (use rproc->priv to access platform-specific private data). +The boot address, in case needed, can be found in rproc->bootaddr (remoteproc +core puts there the ELF entry point). +On success, 0 should be returned, and on failure, an appropriate error code. + +The ->stop() handler takes an rproc handle and powers the device down. +On success, 0 is returned, and on failure, an appropriate error code. + +The ->kick() handler takes an rproc handle, and an index of a virtqueue +where new message was placed in. Implementations should interrupt the remote +processor and let it know it has pending messages. Notifying remote processors +the exact virtqueue index to look in is optional: it is easy (and not +too expensive) to go through the existing virtqueues and look for new buffers +in the used rings. + +Binary Firmware Structure +========================= + +At this point remoteproc supports ELF32 and ELF64 firmware binaries. However, +it is quite expected that other platforms/devices which we'd want to +support with this framework will be based on different binary formats. + +When those use cases show up, we will have to decouple the binary format +from the framework core, so we can support several binary formats without +duplicating common code. + +When the firmware is parsed, its various segments are loaded to memory +according to the specified device address (might be a physical address +if the remote processor is accessing memory directly). + +In addition to the standard ELF segments, most remote processors would +also include a special section which we call "the resource table". + +The resource table contains system resources that the remote processor +requires before it should be powered on, such as allocation of physically +contiguous memory, or iommu mapping of certain on-chip peripherals. +Remotecore will only power up the device after all the resource table's +requirement are met. + +In addition to system resources, the resource table may also contain +resource entries that publish the existence of supported features +or configurations by the remote processor, such as trace buffers and +supported virtio devices (and their configurations). + +The resource table begins with this header:: + + /** + * struct resource_table - firmware resource table header + * @ver: version number + * @num: number of resource entries + * @reserved: reserved (must be zero) + * @offset: array of offsets pointing at the various resource entries + * + * The header of the resource table, as expressed by this structure, + * contains a version number (should we need to change this format in the + * future), the number of available resource entries, and their offsets + * in the table. + */ + struct resource_table { + u32 ver; + u32 num; + u32 reserved[2]; + u32 offset[0]; + } __packed; + +Immediately following this header are the resource entries themselves, +each of which begins with the following resource entry header:: + + /** + * struct fw_rsc_hdr - firmware resource entry header + * @type: resource type + * @data: resource data + * + * Every resource entry begins with a 'struct fw_rsc_hdr' header providing + * its @type. The content of the entry itself will immediately follow + * this header, and it should be parsed according to the resource type. + */ + struct fw_rsc_hdr { + u32 type; + u8 data[0]; + } __packed; + +Some resources entries are mere announcements, where the host is informed +of specific remoteproc configuration. Other entries require the host to +do something (e.g. allocate a system resource). Sometimes a negotiation +is expected, where the firmware requests a resource, and once allocated, +the host should provide back its details (e.g. address of an allocated +memory region). + +Here are the various resource types that are currently supported:: + + /** + * enum fw_resource_type - types of resource entries + * + * @RSC_CARVEOUT: request for allocation of a physically contiguous + * memory region. + * @RSC_DEVMEM: request to iommu_map a memory-based peripheral. + * @RSC_TRACE: announces the availability of a trace buffer into which + * the remote processor will be writing logs. + * @RSC_VDEV: declare support for a virtio device, and serve as its + * virtio header. + * @RSC_LAST: just keep this one at the end + * @RSC_VENDOR_START: start of the vendor specific resource types range + * @RSC_VENDOR_END: end of the vendor specific resource types range + * + * Please note that these values are used as indices to the rproc_handle_rsc + * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to + * check the validity of an index before the lookup table is accessed, so + * please update it as needed. + */ + enum fw_resource_type { + RSC_CARVEOUT = 0, + RSC_DEVMEM = 1, + RSC_TRACE = 2, + RSC_VDEV = 3, + RSC_LAST = 4, + RSC_VENDOR_START = 128, + RSC_VENDOR_END = 512, + }; + +For more details regarding a specific resource type, please see its +dedicated structure in include/linux/remoteproc.h. + +We also expect that platform-specific resource entries will show up +at some point. When that happens, we could easily add a new RSC_PLATFORM +type, and hand those resources to the platform-specific rproc driver to handle. + +Virtio and remoteproc +===================== + +The firmware should provide remoteproc information about virtio devices +that it supports, and their configurations: a RSC_VDEV resource entry +should specify the virtio device id (as in virtio_ids.h), virtio features, +virtio config space, vrings information, etc. + +When a new remote processor is registered, the remoteproc framework +will look for its resource table and will register the virtio devices +it supports. A firmware may support any number of virtio devices, and +of any type (a single remote processor can also easily support several +rpmsg virtio devices this way, if desired). + +Of course, RSC_VDEV resource entries are only good enough for static +allocation of virtio devices. Dynamic allocations will also be made possible +using the rpmsg bus (similar to how we already do dynamic allocations of +rpmsg channels; read more about it in rpmsg.txt). diff --git a/Documentation/staging/rpmsg.rst b/Documentation/staging/rpmsg.rst new file mode 100644 index 000000000000..24b7a9e1a5f9 --- /dev/null +++ b/Documentation/staging/rpmsg.rst @@ -0,0 +1,341 @@ +============================================ +Remote Processor Messaging (rpmsg) Framework +============================================ + +.. note:: + + This document describes the rpmsg bus and how to write rpmsg drivers. + To learn how to add rpmsg support for new platforms, check out remoteproc.txt + (also a resident of Documentation/). + +Introduction +============ + +Modern SoCs typically employ heterogeneous remote processor devices in +asymmetric multiprocessing (AMP) configurations, which may be running +different instances of operating system, whether it's Linux or any other +flavor of real-time OS. + +OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP. +Typically, the dual cortex-A9 is running Linux in a SMP configuration, +and each of the other three cores (two M3 cores and a DSP) is running +its own instance of RTOS in an AMP configuration. + +Typically AMP remote processors employ dedicated DSP codecs and multimedia +hardware accelerators, and therefore are often used to offload CPU-intensive +multimedia tasks from the main application processor. + +These remote processors could also be used to control latency-sensitive +sensors, drive random hardware blocks, or just perform background tasks +while the main CPU is idling. + +Users of those remote processors can either be userland apps (e.g. multimedia +frameworks talking with remote OMX components) or kernel drivers (controlling +hardware accessible only by the remote processor, reserving kernel-controlled +resources on behalf of the remote processor, etc..). + +Rpmsg is a virtio-based messaging bus that allows kernel drivers to communicate +with remote processors available on the system. In turn, drivers could then +expose appropriate user space interfaces, if needed. + +When writing a driver that exposes rpmsg communication to userland, please +keep in mind that remote processors might have direct access to the +system's physical memory and other sensitive hardware resources (e.g. on +OMAP4, remote cores and hardware accelerators may have direct access to the +physical memory, gpio banks, dma controllers, i2c bus, gptimers, mailbox +devices, hwspinlocks, etc..). Moreover, those remote processors might be +running RTOS where every task can access the entire memory/devices exposed +to the processor. To minimize the risks of rogue (or buggy) userland code +exploiting remote bugs, and by that taking over the system, it is often +desired to limit userland to specific rpmsg channels (see definition below) +it can send messages on, and if possible, minimize how much control +it has over the content of the messages. + +Every rpmsg device is a communication channel with a remote processor (thus +rpmsg devices are called channels). Channels are identified by a textual name +and have a local ("source") rpmsg address, and remote ("destination") rpmsg +address. + +When a driver starts listening on a channel, its rx callback is bound with +a unique rpmsg local address (a 32-bit integer). This way when inbound messages +arrive, the rpmsg core dispatches them to the appropriate driver according +to their destination address (this is done by invoking the driver's rx handler +with the payload of the inbound message). + + +User API +======== + +:: + + int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len); + +sends a message across to the remote processor on a given channel. +The caller should specify the channel, the data it wants to send, +and its length (in bytes). The message will be sent on the specified +channel, i.e. its source and destination address fields will be +set to the channel's src and dst addresses. + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + int rpmsg_sendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst); + +sends a message across to the remote processor on a given channel, +to a destination address provided by the caller. + +The caller should specify the channel, the data it wants to send, +its length (in bytes), and an explicit destination address. + +The message will then be sent to the remote processor to which the +channel belongs, using the channel's src address, and the user-provided +dst address (thus the channel's dst address will be ignored). + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + int rpmsg_send_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, + void *data, int len); + + +sends a message across to the remote processor, using the src and dst +addresses provided by the user. + +The caller should specify the channel, the data it wants to send, +its length (in bytes), and explicit source and destination addresses. +The message will then be sent to the remote processor to which the +channel belongs, but the channel's src and dst addresses will be +ignored (and the user-provided addresses will be used instead). + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + int rpmsg_trysend(struct rpmsg_channel *rpdev, void *data, int len); + +sends a message across to the remote processor on a given channel. +The caller should specify the channel, the data it wants to send, +and its length (in bytes). The message will be sent on the specified +channel, i.e. its source and destination address fields will be +set to the channel's src and dst addresses. + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + int rpmsg_trysendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst) + + +sends a message across to the remote processor on a given channel, +to a destination address provided by the user. + +The user should specify the channel, the data it wants to send, +its length (in bytes), and an explicit destination address. + +The message will then be sent to the remote processor to which the +channel belongs, using the channel's src address, and the user-provided +dst address (thus the channel's dst address will be ignored). + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + int rpmsg_trysend_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, + void *data, int len); + + +sends a message across to the remote processor, using source and +destination addresses provided by the user. + +The user should specify the channel, the data it wants to send, +its length (in bytes), and explicit source and destination addresses. +The message will then be sent to the remote processor to which the +channel belongs, but the channel's src and dst addresses will be +ignored (and the user-provided addresses will be used instead). + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: + + struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_channel *rpdev, + void (*cb)(struct rpmsg_channel *, void *, int, void *, u32), + void *priv, u32 addr); + +every rpmsg address in the system is bound to an rx callback (so when +inbound messages arrive, they are dispatched by the rpmsg bus using the +appropriate callback handler) by means of an rpmsg_endpoint struct. + +This function allows drivers to create such an endpoint, and by that, +bind a callback, and possibly some private data too, to an rpmsg address +(either one that is known in advance, or one that will be dynamically +assigned for them). + +Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint +is already created for them when they are probed by the rpmsg bus +(using the rx callback they provide when they registered to the rpmsg bus). + +So things should just work for simple drivers: they already have an +endpoint, their rx callback is bound to their rpmsg address, and when +relevant inbound messages arrive (i.e. messages which their dst address +equals to the src address of their rpmsg channel), the driver's handler +is invoked to process it. + +That said, more complicated drivers might do need to allocate +additional rpmsg addresses, and bind them to different rx callbacks. +To accomplish that, those drivers need to call this function. +Drivers should provide their channel (so the new endpoint would bind +to the same remote processor their channel belongs to), an rx callback +function, an optional private data (which is provided back when the +rx callback is invoked), and an address they want to bind with the +callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will +dynamically assign them an available rpmsg address (drivers should have +a very good reason why not to always use RPMSG_ADDR_ANY here). + +Returns a pointer to the endpoint on success, or NULL on error. + +:: + + void rpmsg_destroy_ept(struct rpmsg_endpoint *ept); + + +destroys an existing rpmsg endpoint. user should provide a pointer +to an rpmsg endpoint that was previously created with rpmsg_create_ept(). + +:: + + int register_rpmsg_driver(struct rpmsg_driver *rpdrv); + + +registers an rpmsg driver with the rpmsg bus. user should provide +a pointer to an rpmsg_driver struct, which contains the driver's +->probe() and ->remove() functions, an rx callback, and an id_table +specifying the names of the channels this driver is interested to +be probed with. + +:: + + void unregister_rpmsg_driver(struct rpmsg_driver *rpdrv); + + +unregisters an rpmsg driver from the rpmsg bus. user should provide +a pointer to a previously-registered rpmsg_driver struct. +Returns 0 on success, and an appropriate error value on failure. + + +Typical usage +============= + +The following is a simple rpmsg driver, that sends an "hello!" message +on probe(), and whenever it receives an incoming message, it dumps its +content to the console. + +:: + + #include <linux/kernel.h> + #include <linux/module.h> + #include <linux/rpmsg.h> + + static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len, + void *priv, u32 src) + { + print_hex_dump(KERN_INFO, "incoming message:", DUMP_PREFIX_NONE, + 16, 1, data, len, true); + } + + static int rpmsg_sample_probe(struct rpmsg_channel *rpdev) + { + int err; + + dev_info(&rpdev->dev, "chnl: 0x%x -> 0x%x\n", rpdev->src, rpdev->dst); + + /* send a message on our channel */ + err = rpmsg_send(rpdev, "hello!", 6); + if (err) { + pr_err("rpmsg_send failed: %d\n", err); + return err; + } + + return 0; + } + + static void rpmsg_sample_remove(struct rpmsg_channel *rpdev) + { + dev_info(&rpdev->dev, "rpmsg sample client driver is removed\n"); + } + + static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = { + { .name = "rpmsg-client-sample" }, + { }, + }; + MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table); + + static struct rpmsg_driver rpmsg_sample_client = { + .drv.name = KBUILD_MODNAME, + .id_table = rpmsg_driver_sample_id_table, + .probe = rpmsg_sample_probe, + .callback = rpmsg_sample_cb, + .remove = rpmsg_sample_remove, + }; + module_rpmsg_driver(rpmsg_sample_client); + +.. note:: + + a similar sample which can be built and loaded can be found + in samples/rpmsg/. + +Allocations of rpmsg channels +============================= + +At this point we only support dynamic allocations of rpmsg channels. + +This is possible only with remote processors that have the VIRTIO_RPMSG_F_NS +virtio device feature set. This feature bit means that the remote +processor supports dynamic name service announcement messages. + +When this feature is enabled, creation of rpmsg devices (i.e. channels) +is completely dynamic: the remote processor announces the existence of a +remote rpmsg service by sending a name service message (which contains +the name and rpmsg addr of the remote service, see struct rpmsg_ns_msg). + +This message is then handled by the rpmsg bus, which in turn dynamically +creates and registers an rpmsg channel (which represents the remote service). +If/when a relevant rpmsg driver is registered, it will be immediately probed +by the bus, and can then start sending messages to the remote service. + +The plan is also to add static creation of rpmsg channels via the virtio +config space, but it's not implemented yet. diff --git a/Documentation/staging/speculation.rst b/Documentation/staging/speculation.rst new file mode 100644 index 000000000000..8045d99bcf12 --- /dev/null +++ b/Documentation/staging/speculation.rst @@ -0,0 +1,92 @@ +=========== +Speculation +=========== + +This document explains potential effects of speculation, and how undesirable +effects can be mitigated portably using common APIs. + +------------------------------------------------------------------------------ + +To improve performance and minimize average latencies, many contemporary CPUs +employ speculative execution techniques such as branch prediction, performing +work which may be discarded at a later stage. + +Typically speculative execution cannot be observed from architectural state, +such as the contents of registers. However, in some cases it is possible to +observe its impact on microarchitectural state, such as the presence or +absence of data in caches. Such state may form side-channels which can be +observed to extract secret information. + +For example, in the presence of branch prediction, it is possible for bounds +checks to be ignored by code which is speculatively executed. Consider the +following code:: + + int load_array(int *array, unsigned int index) + { + if (index >= MAX_ARRAY_ELEMS) + return 0; + else + return array[index]; + } + +Which, on arm64, may be compiled to an assembly sequence such as:: + + CMP <index>, #MAX_ARRAY_ELEMS + B.LT less + MOV <returnval>, #0 + RET + less: + LDR <returnval>, [<array>, <index>] + RET + +It is possible that a CPU mis-predicts the conditional branch, and +speculatively loads array[index], even if index >= MAX_ARRAY_ELEMS. This +value will subsequently be discarded, but the speculated load may affect +microarchitectural state which can be subsequently measured. + +More complex sequences involving multiple dependent memory accesses may +result in sensitive information being leaked. Consider the following +code, building on the prior example:: + + int load_dependent_arrays(int *arr1, int *arr2, int index) + { + int val1, val2, + + val1 = load_array(arr1, index); + val2 = load_array(arr2, val1); + + return val2; + } + +Under speculation, the first call to load_array() may return the value +of an out-of-bounds address, while the second call will influence +microarchitectural state dependent on this value. This may provide an +arbitrary read primitive. + +==================================== +Mitigating speculation side-channels +==================================== + +The kernel provides a generic API to ensure that bounds checks are +respected even under speculation. Architectures which are affected by +speculation-based side-channels are expected to implement these +primitives. + +The array_index_nospec() helper in <linux/nospec.h> can be used to +prevent information from being leaked via side-channels. + +A call to array_index_nospec(index, size) returns a sanitized index +value that is bounded to [0, size) even under cpu speculation +conditions. + +This can be used to protect the earlier load_array() example:: + + int load_array(int *array, unsigned int index) + { + if (index >= MAX_ARRAY_ELEMS) + return 0; + else { + index = array_index_nospec(index, MAX_ARRAY_ELEMS); + return array[index]; + } + } diff --git a/Documentation/staging/static-keys.rst b/Documentation/staging/static-keys.rst new file mode 100644 index 000000000000..38290b9f25eb --- /dev/null +++ b/Documentation/staging/static-keys.rst @@ -0,0 +1,331 @@ +=========== +Static Keys +=========== + +.. warning:: + + DEPRECATED API: + + The use of 'struct static_key' directly, is now DEPRECATED. In addition + static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:: + + struct static_key false = STATIC_KEY_INIT_FALSE; + struct static_key true = STATIC_KEY_INIT_TRUE; + static_key_true() + static_key_false() + + The updated API replacements are:: + + DEFINE_STATIC_KEY_TRUE(key); + DEFINE_STATIC_KEY_FALSE(key); + DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); + DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); + static_branch_likely() + static_branch_unlikely() + +Abstract +======== + +Static keys allows the inclusion of seldom used features in +performance-sensitive fast-path kernel code, via a GCC feature and a code +patching technique. A quick example:: + + DEFINE_STATIC_KEY_FALSE(key); + + ... + + if (static_branch_unlikely(&key)) + do unlikely code + else + do likely code + + ... + static_branch_enable(&key); + ... + static_branch_disable(&key); + ... + +The static_branch_unlikely() branch will be generated into the code with as little +impact to the likely code path as possible. + + +Motivation +========== + + +Currently, tracepoints are implemented using a conditional branch. The +conditional check requires checking a global variable for each tracepoint. +Although the overhead of this check is small, it increases when the memory +cache comes under pressure (memory cache lines for these global variables may +be shared with other memory accesses). As we increase the number of tracepoints +in the kernel this overhead may become more of an issue. In addition, +tracepoints are often dormant (disabled) and provide no direct kernel +functionality. Thus, it is highly desirable to reduce their impact as much as +possible. Although tracepoints are the original motivation for this work, other +kernel code paths should be able to make use of the static keys facility. + + +Solution +======== + + +gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: + +https://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html + +Using the 'asm goto', we can create branches that are either taken or not taken +by default, without the need to check memory. Then, at run-time, we can patch +the branch site to change the branch direction. + +For example, if we have a simple branch that is disabled by default:: + + if (static_branch_unlikely(&key)) + printk("I am the true branch\n"); + +Thus, by default the 'printk' will not be emitted. And the code generated will +consist of a single atomic 'no-op' instruction (5 bytes on x86), in the +straight-line code path. When the branch is 'flipped', we will patch the +'no-op' in the straight-line codepath with a 'jump' instruction to the +out-of-line true branch. Thus, changing branch direction is expensive but +branch selection is basically 'free'. That is the basic tradeoff of this +optimization. + +This lowlevel patching mechanism is called 'jump label patching', and it gives +the basis for the static keys facility. + +Static key label API, usage and examples +======================================== + + +In order to make use of this optimization you must first define a key:: + + DEFINE_STATIC_KEY_TRUE(key); + +or:: + + DEFINE_STATIC_KEY_FALSE(key); + + +The key must be global, that is, it can't be allocated on the stack or dynamically +allocated at run-time. + +The key is then used in code as:: + + if (static_branch_unlikely(&key)) + do unlikely code + else + do likely code + +Or:: + + if (static_branch_likely(&key)) + do likely code + else + do unlikely code + +Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may +be used in either static_branch_likely() or static_branch_unlikely() +statements. + +Branch(es) can be set true via:: + + static_branch_enable(&key); + +or false via:: + + static_branch_disable(&key); + +The branch(es) can then be switched via reference counts:: + + static_branch_inc(&key); + ... + static_branch_dec(&key); + +Thus, 'static_branch_inc()' means 'make the branch true', and +'static_branch_dec()' means 'make the branch false' with appropriate +reference counting. For example, if the key is initialized true, a +static_branch_dec(), will switch the branch to false. And a subsequent +static_branch_inc(), will change the branch back to true. Likewise, if the +key is initialized false, a 'static_branch_inc()', will change the branch to +true. And then a 'static_branch_dec()', will again make the branch false. + +The state and the reference count can be retrieved with 'static_key_enabled()' +and 'static_key_count()'. In general, if you use these functions, they +should be protected with the same mutex used around the enable/disable +or increment/decrement function. + +Note that switching branches results in some locks being taken, +particularly the CPU hotplug lock (in order to avoid races against +CPUs being brought in the kernel while the kernel is getting +patched). Calling the static key API from within a hotplug notifier is +thus a sure deadlock recipe. In order to still allow use of the +functionality, the following functions are provided: + + static_key_enable_cpuslocked() + static_key_disable_cpuslocked() + static_branch_enable_cpuslocked() + static_branch_disable_cpuslocked() + +These functions are *not* general purpose, and must only be used when +you really know that you're in the above context, and no other. + +Where an array of keys is required, it can be defined as:: + + DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); + +or:: + + DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); + +4) Architecture level code patching interface, 'jump labels' + + +There are a few functions and macros that architectures must implement in order +to take advantage of this optimization. If there is no architecture support, we +simply fall back to a traditional, load, test, and jump sequence. Also, the +struct jump_entry table must be at least 4-byte aligned because the +static_key->entry field makes use of the two least significant bits. + +* ``select HAVE_ARCH_JUMP_LABEL``, + see: arch/x86/Kconfig + +* ``#define JUMP_LABEL_NOP_SIZE``, + see: arch/x86/include/asm/jump_label.h + +* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``, + see: arch/x86/include/asm/jump_label.h + +* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``, + see: arch/x86/include/asm/jump_label.h + +* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``, + see: arch/x86/kernel/jump_label.c + +* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``, + see: arch/x86/kernel/jump_label.c + +* ``struct jump_entry``, + see: arch/x86/include/asm/jump_label.h + + +5) Static keys / jump label analysis, results (x86_64): + + +As an example, let's add the following branch to 'getppid()', such that the +system call now looks like:: + + SYSCALL_DEFINE0(getppid) + { + int pid; + + + if (static_branch_unlikely(&key)) + + printk("I am the true branch\n"); + + rcu_read_lock(); + pid = task_tgid_vnr(rcu_dereference(current->real_parent)); + rcu_read_unlock(); + + return pid; + } + +The resulting instructions with jump labels generated by GCC is:: + + ffffffff81044290 <sys_getppid>: + ffffffff81044290: 55 push %rbp + ffffffff81044291: 48 89 e5 mov %rsp,%rbp + ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> + ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax + ffffffff810442a0: 00 00 + ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax + ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax + ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi + ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> + ffffffff810442bc: 5d pop %rbp + ffffffff810442bd: 48 98 cltq + ffffffff810442bf: c3 retq + ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi + ffffffff810442c7: 31 c0 xor %eax,%eax + ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> + ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> + +Without the jump label optimization it looks like:: + + ffffffff810441f0 <sys_getppid>: + ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> + ffffffff810441f6: 55 push %rbp + ffffffff810441f7: 48 89 e5 mov %rsp,%rbp + ffffffff810441fa: 85 c0 test %eax,%eax + ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> + ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax + ffffffff81044205: 00 00 + ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax + ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax + ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi + ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> + ffffffff81044221: 5d pop %rbp + ffffffff81044222: 48 98 cltq + ffffffff81044224: c3 retq + ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi + ffffffff8104422c: 31 c0 xor %eax,%eax + ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> + ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> + ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) + ffffffff8104423c: 00 00 00 00 + +Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction +vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched +to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump +label case adds:: + + 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. + +If we then include the padding bytes, the jump label code saves, 16 total bytes +of instruction memory for this small function. In this case the non-jump label +function is 80 bytes long. Thus, we have saved 20% of the instruction +footprint. We can in fact improve this even further, since the 5-byte no-op +really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. +However, we have not yet implemented optimal no-op sizes (they are currently +hard-coded). + +Since there are a number of static key API uses in the scheduler paths, +'pipe-test' (also known as 'perf bench sched pipe') can be used to show the +performance improvement. Testing done on 3.3.0-rc2: + +jump label disabled:: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) + 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) + 487 page-faults # 0.001 M/sec ( +- 0.02% ) + 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) + <not supported> stalled-cycles-frontend + <not supported> stalled-cycles-backend + 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) + 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) + 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) + + 1.601607384 seconds time elapsed ( +- 0.07% ) + +jump label enabled:: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) + 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) + 487 page-faults # 0.001 M/sec ( +- 0.05% ) + 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) + <not supported> stalled-cycles-frontend + <not supported> stalled-cycles-backend + 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) + 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) + 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) + + 1.579384366 seconds time elapsed + +The percentage of saved branches is .7%, and we've saved 12% on +'branch-misses'. This is where we would expect to get the most savings, since +this optimization is about reducing the number of branches. In addition, we've +saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time. diff --git a/Documentation/staging/tee.rst b/Documentation/staging/tee.rst new file mode 100644 index 000000000000..62e8ba64d04f --- /dev/null +++ b/Documentation/staging/tee.rst @@ -0,0 +1,277 @@ +============= +TEE subsystem +============= + +This document describes the TEE subsystem in Linux. + +A TEE (Trusted Execution Environment) is a trusted OS running in some +secure environment, for example, TrustZone on ARM CPUs, or a separate +secure co-processor etc. A TEE driver handles the details needed to +communicate with the TEE. + +This subsystem deals with: + +- Registration of TEE drivers + +- Managing shared memory between Linux and the TEE + +- Providing a generic API to the TEE + +The TEE interface +================= + +include/uapi/linux/tee.h defines the generic interface to a TEE. + +User space (the client) connects to the driver by opening /dev/tee[0-9]* or +/dev/teepriv[0-9]*. + +- TEE_IOC_SHM_ALLOC allocates shared memory and returns a file descriptor + which user space can mmap. When user space doesn't need the file + descriptor any more, it should be closed. When shared memory isn't needed + any longer it should be unmapped with munmap() to allow the reuse of + memory. + +- TEE_IOC_VERSION lets user space know which TEE this driver handles and + its capabilities. + +- TEE_IOC_OPEN_SESSION opens a new session to a Trusted Application. + +- TEE_IOC_INVOKE invokes a function in a Trusted Application. + +- TEE_IOC_CANCEL may cancel an ongoing TEE_IOC_OPEN_SESSION or TEE_IOC_INVOKE. + +- TEE_IOC_CLOSE_SESSION closes a session to a Trusted Application. + +There are two classes of clients, normal clients and supplicants. The latter is +a helper process for the TEE to access resources in Linux, for example file +system access. A normal client opens /dev/tee[0-9]* and a supplicant opens +/dev/teepriv[0-9]. + +Much of the communication between clients and the TEE is opaque to the +driver. The main job for the driver is to receive requests from the +clients, forward them to the TEE and send back the results. In the case of +supplicants the communication goes in the other direction, the TEE sends +requests to the supplicant which then sends back the result. + +The TEE kernel interface +======================== + +Kernel provides a TEE bus infrastructure where a Trusted Application is +represented as a device identified via Universally Unique Identifier (UUID) and +client drivers register a table of supported device UUIDs. + +TEE bus infrastructure registers following APIs: +- match(): iterates over the client driver UUID table to find a corresponding + match for device UUID. If a match is found, then this particular device is + probed via corresponding probe API registered by the client driver. This + process happens whenever a device or a client driver is registered with TEE + bus. +- uevent(): notifies user-space (udev) whenever a new device is registered on + TEE bus for auto-loading of modularized client drivers. + +TEE bus device enumeration is specific to underlying TEE implementation, so it +is left open for TEE drivers to provide corresponding implementation. + +Then TEE client driver can talk to a matched Trusted Application using APIs +listed in include/linux/tee_drv.h. + +TEE client driver example +------------------------- + +Suppose a TEE client driver needs to communicate with a Trusted Application +having UUID: ``ac6a4085-0e82-4c33-bf98-8eb8e118b6c2``, so driver registration +snippet would look like:: + + static const struct tee_client_device_id client_id_table[] = { + {UUID_INIT(0xac6a4085, 0x0e82, 0x4c33, + 0xbf, 0x98, 0x8e, 0xb8, 0xe1, 0x18, 0xb6, 0xc2)}, + {} + }; + + MODULE_DEVICE_TABLE(tee, client_id_table); + + static struct tee_client_driver client_driver = { + .id_table = client_id_table, + .driver = { + .name = DRIVER_NAME, + .bus = &tee_bus_type, + .probe = client_probe, + .remove = client_remove, + }, + }; + + static int __init client_init(void) + { + return driver_register(&client_driver.driver); + } + + static void __exit client_exit(void) + { + driver_unregister(&client_driver.driver); + } + + module_init(client_init); + module_exit(client_exit); + +OP-TEE driver +============= + +The OP-TEE driver handles OP-TEE [1] based TEEs. Currently it is only the ARM +TrustZone based OP-TEE solution that is supported. + +Lowest level of communication with OP-TEE builds on ARM SMC Calling +Convention (SMCCC) [2], which is the foundation for OP-TEE's SMC interface +[3] used internally by the driver. Stacked on top of that is OP-TEE Message +Protocol [4]. + +OP-TEE SMC interface provides the basic functions required by SMCCC and some +additional functions specific for OP-TEE. The most interesting functions are: + +- OPTEE_SMC_FUNCID_CALLS_UID (part of SMCCC) returns the version information + which is then returned by TEE_IOC_VERSION + +- OPTEE_SMC_CALL_GET_OS_UUID returns the particular OP-TEE implementation, used + to tell, for instance, a TrustZone OP-TEE apart from an OP-TEE running on a + separate secure co-processor. + +- OPTEE_SMC_CALL_WITH_ARG drives the OP-TEE message protocol + +- OPTEE_SMC_GET_SHM_CONFIG lets the driver and OP-TEE agree on which memory + range to used for shared memory between Linux and OP-TEE. + +The GlobalPlatform TEE Client API [5] is implemented on top of the generic +TEE API. + +Picture of the relationship between the different components in the +OP-TEE architecture:: + + User space Kernel Secure world + ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~ + +--------+ +-------------+ + | Client | | Trusted | + +--------+ | Application | + /\ +-------------+ + || +----------+ /\ + || |tee- | || + || |supplicant| \/ + || +----------+ +-------------+ + \/ /\ | TEE Internal| + +-------+ || | API | + + TEE | || +--------+--------+ +-------------+ + | Client| || | TEE | OP-TEE | | OP-TEE | + | API | \/ | subsys | driver | | Trusted OS | + +-------+----------------+----+-------+----+-----------+-------------+ + | Generic TEE API | | OP-TEE MSG | + | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) | + +-----------------------------+ +------------------------------+ + +RPC (Remote Procedure Call) are requests from secure world to kernel driver +or tee-supplicant. An RPC is identified by a special range of SMCCC return +values from OPTEE_SMC_CALL_WITH_ARG. RPC messages which are intended for the +kernel are handled by the kernel driver. Other RPC messages will be forwarded to +tee-supplicant without further involvement of the driver, except switching +shared memory buffer representation. + +OP-TEE device enumeration +------------------------- + +OP-TEE provides a pseudo Trusted Application: drivers/tee/optee/device.c in +order to support device enumeration. In other words, OP-TEE driver invokes this +application to retrieve a list of Trusted Applications which can be registered +as devices on the TEE bus. + +AMD-TEE driver +============== + +The AMD-TEE driver handles the communication with AMD's TEE environment. The +TEE environment is provided by AMD Secure Processor. + +The AMD Secure Processor (formerly called Platform Security Processor or PSP) +is a dedicated processor that features ARM TrustZone technology, along with a +software-based Trusted Execution Environment (TEE) designed to enable +third-party Trusted Applications. This feature is currently enabled only for +APUs. + +The following picture shows a high level overview of AMD-TEE:: + + | + x86 | + | + User space (Kernel space) | AMD Secure Processor (PSP) + ~~~~~~~~~~ ~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~ + | + +--------+ | +-------------+ + | Client | | | Trusted | + +--------+ | | Application | + /\ | +-------------+ + || | /\ + || | || + || | \/ + || | +----------+ + || | | TEE | + || | | Internal | + \/ | | API | + +---------+ +-----------+---------+ +----------+ + | TEE | | TEE | AMD-TEE | | AMD-TEE | + | Client | | subsystem | driver | | Trusted | + | API | | | | | OS | + +---------+-----------+----+------+---------+---------+----------+ + | Generic TEE API | | ASP | Mailbox | + | IOCTL (TEE_IOC_*) | | driver | Register Protocol | + +--------------------------+ +---------+--------------------+ + +At the lowest level (in x86), the AMD Secure Processor (ASP) driver uses the +CPU to PSP mailbox regsister to submit commands to the PSP. The format of the +command buffer is opaque to the ASP driver. It's role is to submit commands to +the secure processor and return results to AMD-TEE driver. The interface +between AMD-TEE driver and AMD Secure Processor driver can be found in [6]. + +The AMD-TEE driver packages the command buffer payload for processing in TEE. +The command buffer format for the different TEE commands can be found in [7]. + +The TEE commands supported by AMD-TEE Trusted OS are: + +* TEE_CMD_ID_LOAD_TA - loads a Trusted Application (TA) binary into + TEE environment. +* TEE_CMD_ID_UNLOAD_TA - unloads TA binary from TEE environment. +* TEE_CMD_ID_OPEN_SESSION - opens a session with a loaded TA. +* TEE_CMD_ID_CLOSE_SESSION - closes session with loaded TA +* TEE_CMD_ID_INVOKE_CMD - invokes a command with loaded TA +* TEE_CMD_ID_MAP_SHARED_MEM - maps shared memory +* TEE_CMD_ID_UNMAP_SHARED_MEM - unmaps shared memory + +AMD-TEE Trusted OS is the firmware running on AMD Secure Processor. + +The AMD-TEE driver registers itself with TEE subsystem and implements the +following driver function callbacks: + +* get_version - returns the driver implementation id and capability. +* open - sets up the driver context data structure. +* release - frees up driver resources. +* open_session - loads the TA binary and opens session with loaded TA. +* close_session - closes session with loaded TA and unloads it. +* invoke_func - invokes a command with loaded TA. + +cancel_req driver callback is not supported by AMD-TEE. + +The GlobalPlatform TEE Client API [5] can be used by the user space (client) to +talk to AMD's TEE. AMD's TEE provides a secure environment for loading, opening +a session, invoking commands and clossing session with TA. + +References +========== + +[1] https://github.com/OP-TEE/optee_os + +[2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html + +[3] drivers/tee/optee/optee_smc.h + +[4] drivers/tee/optee/optee_msg.h + +[5] http://www.globalplatform.org/specificationsdevice.asp look for + "TEE Client API Specification v1.0" and click download. + +[6] include/linux/psp-tee.h + +[7] drivers/tee/amdtee/amdtee_if.h diff --git a/Documentation/staging/xz.rst b/Documentation/staging/xz.rst new file mode 100644 index 000000000000..b2f5ff12a161 --- /dev/null +++ b/Documentation/staging/xz.rst @@ -0,0 +1,127 @@ +============================ +XZ data compression in Linux +============================ + +Introduction +============ + +XZ is a general purpose data compression format with high compression +ratio and relatively fast decompression. The primary compression +algorithm (filter) is LZMA2. Additional filters can be used to improve +compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters +improve compression ratio of executable data. + +The XZ decompressor in Linux is called XZ Embedded. It supports +the LZMA2 filter and optionally also BCJ filters. CRC32 is supported +for integrity checking. The home page of XZ Embedded is at +<https://tukaani.org/xz/embedded.html>, where you can find the +latest version and also information about using the code outside +the Linux kernel. + +For userspace, XZ Utils provide a zlib-like compression library +and a gzip-like command line tool. XZ Utils can be downloaded from +<https://tukaani.org/xz/>. + +XZ related components in the kernel +=================================== + +The xz_dec module provides XZ decompressor with single-call (buffer +to buffer) and multi-call (stateful) APIs. The usage of the xz_dec +module is documented in include/linux/xz.h. + +The xz_dec_test module is for testing xz_dec. xz_dec_test is not +useful unless you are hacking the XZ decompressor. xz_dec_test +allocates a char device major dynamically to which one can write +.xz files from userspace. The decompressed output is thrown away. +Keep an eye on dmesg to see diagnostics printed by xz_dec_test. +See the xz_dec_test source code for the details. + +For decompressing the kernel image, initramfs, and initrd, there +is a wrapper function in lib/decompress_unxz.c. Its API is the +same as in other decompress_*.c files, which is defined in +include/linux/decompress/generic.h. + +scripts/xz_wrap.sh is a wrapper for the xz command line tool found +from XZ Utils. The wrapper sets compression options to values suitable +for compressing the kernel image. + +For kernel makefiles, two commands are provided for use with +$(call if_needed). The kernel image should be compressed with +$(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2 +dictionary. It will also append a four-byte trailer containing the +uncompressed size of the file, which is needed by the boot code. +Other things should be compressed with $(call if_needed,xzmisc) +which will use no BCJ filter and 1 MiB LZMA2 dictionary. + +Notes on compression options +============================ + +Since the XZ Embedded supports only streams with no integrity check or +CRC32, make sure that you don't use some other integrity check type +when encoding files that are supposed to be decoded by the kernel. With +liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32 +when encoding. With the xz command line tool, use --check=none or +--check=crc32. + +Using CRC32 is strongly recommended unless there is some other layer +which will verify the integrity of the uncompressed data anyway. +Double checking the integrity would probably be waste of CPU cycles. +Note that the headers will always have a CRC32 which will be validated +by the decoder; you can only change the integrity check type (or +disable it) for the actual uncompressed data. + +In userspace, LZMA2 is typically used with dictionary sizes of several +megabytes. The decoder needs to have the dictionary in RAM, thus big +dictionaries cannot be used for files that are intended to be decoded +by the kernel. 1 MiB is probably the maximum reasonable dictionary +size for in-kernel use (maybe more is OK for initramfs). The presets +in XZ Utils may not be optimal when creating files for the kernel, +so don't hesitate to use custom settings. Example:: + + xz --check=crc32 --lzma2=dict=512KiB inputfile + +An exception to above dictionary size limitation is when the decoder +is used in single-call mode. Decompressing the kernel itself is an +example of this situation. In single-call mode, the memory usage +doesn't depend on the dictionary size, and it is perfectly fine to +use a big dictionary: for maximum compression, the dictionary should +be at least as big as the uncompressed data itself. + +Future plans +============ + +Creating a limited XZ encoder may be considered if people think it is +useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at +the fastest settings, so it isn't clear if LZMA2 encoder is wanted +into the kernel. + +Support for limited random-access reading is planned for the +decompression code. I don't know if it could have any use in the +kernel, but I know that it would be useful in some embedded projects +outside the Linux kernel. + +Conformance to the .xz file format specification +================================================ + +There are a couple of corner cases where things have been simplified +at expense of detecting errors as early as possible. These should not +matter in practice all, since they don't cause security issues. But +it is good to know this if testing the code e.g. with the test files +from XZ Utils. + +Reporting bugs +============== + +Before reporting a bug, please check that it's not fixed already +at upstream. See <https://tukaani.org/xz/embedded.html> to get the +latest code. + +Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on +Freenode and talk to Larhzu. I don't actively read LKML or other +kernel-related mailing lists, so if there's something I should know, +you should email to me personally or use IRC. + +Don't bother Igor Pavlov with questions about the XZ implementation +in the kernel or about XZ Utils. While these two implementations +include essential code that is directly based on Igor Pavlov's code, +these implementations aren't maintained nor supported by him. |