summaryrefslogtreecommitdiff
path: root/tools/include
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2022-10-04 23:38:03 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2022-10-04 23:38:03 +0300
commit0326074ff4652329f2a1a9c8685104576bd8d131 (patch)
tree9a7574c7ccb05bf4c7cb34fc5a65457bb8f495cb /tools/include
parent522667b24f08009591c90e75bfe2ffb67f555498 (diff)
parent681bf011b9b5989c6e9db6beb64494918aab9a43 (diff)
downloadlinux-0326074ff4652329f2a1a9c8685104576bd8d131.tar.xz
Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...
Diffstat (limited to 'tools/include')
-rw-r--r--tools/include/uapi/linux/bpf.h182
-rw-r--r--tools/include/uapi/linux/tc_act/tc_bpf.h5
2 files changed, 158 insertions, 29 deletions
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 59a217ca2dfd..51b9aa640ad2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -87,10 +87,35 @@ struct bpf_cgroup_storage_key {
__u32 attach_type; /* program attach type (enum bpf_attach_type) */
};
+enum bpf_cgroup_iter_order {
+ BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
+ BPF_CGROUP_ITER_SELF_ONLY, /* process only a single object. */
+ BPF_CGROUP_ITER_DESCENDANTS_PRE, /* walk descendants in pre-order. */
+ BPF_CGROUP_ITER_DESCENDANTS_POST, /* walk descendants in post-order. */
+ BPF_CGROUP_ITER_ANCESTORS_UP, /* walk ancestors upward. */
+};
+
union bpf_iter_link_info {
struct {
__u32 map_fd;
} map;
+ struct {
+ enum bpf_cgroup_iter_order order;
+
+ /* At most one of cgroup_fd and cgroup_id can be non-zero. If
+ * both are zero, the walk starts from the default cgroup v2
+ * root. For walking v1 hierarchy, one should always explicitly
+ * specify cgroup_fd.
+ */
+ __u32 cgroup_fd;
+ __u64 cgroup_id;
+ } cgroup;
+ /* Parameters of task iterators. */
+ struct {
+ __u32 tid;
+ __u32 pid;
+ __u32 pid_fd;
+ } task;
};
/* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -909,6 +934,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
BPF_MAP_TYPE_BLOOM_FILTER,
+ BPF_MAP_TYPE_USER_RINGBUF,
};
/* Note that tracing related programs such as
@@ -1233,7 +1259,7 @@ enum {
/* Query effective (directly attached + inherited from ancestor cgroups)
* programs that will be executed for events within a cgroup.
- * attach_flags with this flag are returned only for directly attached programs.
+ * attach_flags with this flag are always returned 0.
*/
#define BPF_F_QUERY_EFFECTIVE (1U << 0)
@@ -1432,7 +1458,10 @@ union bpf_attr {
__u32 attach_flags;
__aligned_u64 prog_ids;
__u32 prog_cnt;
- __aligned_u64 prog_attach_flags; /* output: per-program attach_flags */
+ /* output: per-program attach_flags.
+ * not allowed to be set during effective query.
+ */
+ __aligned_u64 prog_attach_flags;
} query;
struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
@@ -2573,10 +2602,12 @@ union bpf_attr {
* There are two supported modes at this time:
*
* * **BPF_ADJ_ROOM_MAC**: Adjust room at the mac layer
- * (room space is added or removed below the layer 2 header).
+ * (room space is added or removed between the layer 2 and
+ * layer 3 headers).
*
* * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer
- * (room space is added or removed below the layer 3 header).
+ * (room space is added or removed between the layer 3 and
+ * layer 4 headers).
*
* The following flags are supported at this time:
*
@@ -3008,8 +3039,18 @@ union bpf_attr {
* **BPF_F_USER_STACK**
* Collect a user space stack instead of a kernel stack.
* **BPF_F_USER_BUILD_ID**
- * Collect buildid+offset instead of ips for user stack,
- * only valid if **BPF_F_USER_STACK** is also specified.
+ * Collect (build_id, file_offset) instead of ips for user
+ * stack, only valid if **BPF_F_USER_STACK** is also
+ * specified.
+ *
+ * *file_offset* is an offset relative to the beginning
+ * of the executable or shared object file backing the vma
+ * which the *ip* falls in. It is *not* an offset relative
+ * to that object's base address. Accordingly, it must be
+ * adjusted by adding (sh_addr - sh_offset), where
+ * sh_{addr,offset} correspond to the executable section
+ * containing *file_offset* in the object, for comparisons
+ * to symbols' st_value to be valid.
*
* **bpf_get_stack**\ () can collect up to
* **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
@@ -4425,7 +4466,7 @@ union bpf_attr {
*
* **-EEXIST** if the option already exists.
*
- * **-EFAULT** on failrue to parse the existing header options.
+ * **-EFAULT** on failure to parse the existing header options.
*
* **-EPERM** if the helper cannot be used under the current
* *skops*\ **->op**.
@@ -4634,7 +4675,7 @@ union bpf_attr {
* a *map* with *task* as the **key**. From this
* perspective, the usage is not much different from
* **bpf_map_lookup_elem**\ (*map*, **&**\ *task*) except this
- * helper enforces the key must be an task_struct and the map must also
+ * helper enforces the key must be a task_struct and the map must also
* be a **BPF_MAP_TYPE_TASK_STORAGE**.
*
* Underneath, the value is stored locally at *task* instead of
@@ -4692,7 +4733,7 @@ union bpf_attr {
*
* long bpf_ima_inode_hash(struct inode *inode, void *dst, u32 size)
* Description
- * Returns the stored IMA hash of the *inode* (if it's avaialable).
+ * Returns the stored IMA hash of the *inode* (if it's available).
* If the hash is larger than *size*, then only *size*
* bytes will be copied to *dst*
* Return
@@ -4716,12 +4757,12 @@ union bpf_attr {
*
* The argument *len_diff* can be used for querying with a planned
* size change. This allows to check MTU prior to changing packet
- * ctx. Providing an *len_diff* adjustment that is larger than the
+ * ctx. Providing a *len_diff* adjustment that is larger than the
* actual packet size (resulting in negative packet size) will in
- * principle not exceed the MTU, why it is not considered a
- * failure. Other BPF-helpers are needed for performing the
- * planned size change, why the responsability for catch a negative
- * packet size belong in those helpers.
+ * principle not exceed the MTU, which is why it is not considered
+ * a failure. Other BPF helpers are needed for performing the
+ * planned size change; therefore the responsibility for catching
+ * a negative packet size belongs in those helpers.
*
* Specifying *ifindex* zero means the MTU check is performed
* against the current net device. This is practical if this isn't
@@ -4919,6 +4960,7 @@ union bpf_attr {
* Get address of the traced function (for tracing and kprobe programs).
* Return
* Address of the traced function.
+ * 0 for kprobes placed within the function (not at the entry).
*
* u64 bpf_get_attach_cookie(void *ctx)
* Description
@@ -5048,12 +5090,12 @@ union bpf_attr {
*
* long bpf_get_func_arg(void *ctx, u32 n, u64 *value)
* Description
- * Get **n**-th argument (zero based) of the traced function (for tracing programs)
+ * Get **n**-th argument register (zero based) of the traced function (for tracing programs)
* returned in **value**.
*
* Return
* 0 on success.
- * **-EINVAL** if n >= arguments count of traced function.
+ * **-EINVAL** if n >= argument register count of traced function.
*
* long bpf_get_func_ret(void *ctx, u64 *value)
* Description
@@ -5066,24 +5108,37 @@ union bpf_attr {
*
* long bpf_get_func_arg_cnt(void *ctx)
* Description
- * Get number of arguments of the traced function (for tracing programs).
+ * Get number of registers of the traced function (for tracing programs) where
+ * function arguments are stored in these registers.
*
* Return
- * The number of arguments of the traced function.
+ * The number of argument registers of the traced function.
*
* int bpf_get_retval(void)
* Description
- * Get the syscall's return value that will be returned to userspace.
+ * Get the BPF program's return value that will be returned to the upper layers.
*
- * This helper is currently supported by cgroup programs only.
+ * This helper is currently supported by cgroup programs and only by the hooks
+ * where BPF program's return value is returned to the userspace via errno.
* Return
- * The syscall's return value.
+ * The BPF program's return value.
*
* int bpf_set_retval(int retval)
* Description
- * Set the syscall's return value that will be returned to userspace.
+ * Set the BPF program's return value that will be returned to the upper layers.
+ *
+ * This helper is currently supported by cgroup programs and only by the hooks
+ * where BPF program's return value is returned to the userspace via errno.
+ *
+ * Note that there is the following corner case where the program exports an error
+ * via bpf_set_retval but signals success via 'return 1':
+ *
+ * bpf_set_retval(-EPERM);
+ * return 1;
+ *
+ * In this case, the BPF program's return value will use helper's -EPERM. This
+ * still holds true for cgroup/bind{4,6} which supports extra 'return 3' success case.
*
- * This helper is currently supported by cgroup programs only.
* Return
* 0 on success, or a negative error in case of failure.
*
@@ -5331,6 +5386,55 @@ union bpf_attr {
* **-EACCES** if the SYN cookie is not valid.
*
* **-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * u64 bpf_ktime_get_tai_ns(void)
+ * Description
+ * A nonsettable system-wide clock derived from wall-clock time but
+ * ignoring leap seconds. This clock does not experience
+ * discontinuities and backwards jumps caused by NTP inserting leap
+ * seconds as CLOCK_REALTIME does.
+ *
+ * See: **clock_gettime**\ (**CLOCK_TAI**)
+ * Return
+ * Current *ktime*.
+ *
+ * long bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags)
+ * Description
+ * Drain samples from the specified user ring buffer, and invoke
+ * the provided callback for each such sample:
+ *
+ * long (\*callback_fn)(struct bpf_dynptr \*dynptr, void \*ctx);
+ *
+ * If **callback_fn** returns 0, the helper will continue to try
+ * and drain the next sample, up to a maximum of
+ * BPF_MAX_USER_RINGBUF_SAMPLES samples. If the return value is 1,
+ * the helper will skip the rest of the samples and return. Other
+ * return values are not used now, and will be rejected by the
+ * verifier.
+ * Return
+ * The number of drained samples if no error was encountered while
+ * draining samples, or 0 if no samples were present in the ring
+ * buffer. If a user-space producer was epoll-waiting on this map,
+ * and at least one sample was drained, they will receive an event
+ * notification notifying them of available space in the ring
+ * buffer. If the BPF_RB_NO_WAKEUP flag is passed to this
+ * function, no wakeup notification will be sent. If the
+ * BPF_RB_FORCE_WAKEUP flag is passed, a wakeup notification will
+ * be sent even if no sample was drained.
+ *
+ * On failure, the returned value is one of the following:
+ *
+ * **-EBUSY** if the ring buffer is contended, and another calling
+ * context was concurrently draining the ring buffer.
+ *
+ * **-EINVAL** if user-space is not properly tracking the ring
+ * buffer due to the producer position not being aligned to 8
+ * bytes, a sample not being aligned to 8 bytes, or the producer
+ * position not matching the advertised length of a sample.
+ *
+ * **-E2BIG** if user-space has tried to publish a sample which is
+ * larger than the size of the ring buffer, or which cannot fit
+ * within a struct bpf_dynptr.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -5541,6 +5645,8 @@ union bpf_attr {
FN(tcp_raw_gen_syncookie_ipv6), \
FN(tcp_raw_check_syncookie_ipv4), \
FN(tcp_raw_check_syncookie_ipv6), \
+ FN(ktime_get_tai_ns), \
+ FN(user_ringbuf_drain), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5603,6 +5709,11 @@ enum {
BPF_F_SEQ_NUMBER = (1ULL << 3),
};
+/* BPF_FUNC_skb_get_tunnel_key flags. */
+enum {
+ BPF_F_TUNINFO_FLAGS = (1ULL << 4),
+};
+
/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
* BPF_FUNC_perf_event_read_value flags.
*/
@@ -5792,7 +5903,10 @@ struct bpf_tunnel_key {
};
__u8 tunnel_tos;
__u8 tunnel_ttl;
- __u16 tunnel_ext; /* Padding, future use. */
+ union {
+ __u16 tunnel_ext; /* compat */
+ __be16 tunnel_flags;
+ };
__u32 tunnel_label;
union {
__u32 local_ipv4;
@@ -5836,6 +5950,11 @@ enum bpf_ret_code {
* represented by BPF_REDIRECT above).
*/
BPF_LWT_REROUTE = 128,
+ /* BPF_FLOW_DISSECTOR_CONTINUE: used by BPF_PROG_TYPE_FLOW_DISSECTOR
+ * to indicate that no custom dissection was performed, and
+ * fallback to standard dissector is requested.
+ */
+ BPF_FLOW_DISSECTOR_CONTINUE = 129,
};
struct bpf_sock {
@@ -6134,11 +6253,26 @@ struct bpf_link_info {
struct {
__aligned_u64 target_name; /* in/out: target_name buffer ptr */
__u32 target_name_len; /* in/out: target_name buffer len */
+
+ /* If the iter specific field is 32 bits, it can be put
+ * in the first or second union. Otherwise it should be
+ * put in the second union.
+ */
union {
struct {
__u32 map_id;
} map;
};
+ union {
+ struct {
+ __u64 cgroup_id;
+ __u32 order;
+ } cgroup;
+ struct {
+ __u32 tid;
+ __u32 pid;
+ } task;
+ };
} iter;
struct {
__u32 netns_ino;
diff --git a/tools/include/uapi/linux/tc_act/tc_bpf.h b/tools/include/uapi/linux/tc_act/tc_bpf.h
index 653c4f94f76e..fe6c8f8f3e8c 100644
--- a/tools/include/uapi/linux/tc_act/tc_bpf.h
+++ b/tools/include/uapi/linux/tc_act/tc_bpf.h
@@ -1,11 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
/*
* Copyright (c) 2015 Jiri Pirko <jiri@resnulli.us>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
*/
#ifndef __LINUX_TC_BPF_H