summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst5
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst40
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst3
-rw-r--r--Documentation/networking/device_drivers/ethernet/meta/fbnic.rst29
-rw-r--r--Documentation/networking/devlink/devlink-region.rst2
-rw-r--r--Documentation/networking/devlink/ice.rst25
-rw-r--r--Documentation/networking/devlink/octeontx2.rst16
-rw-r--r--Documentation/networking/devmem.rst269
-rw-r--r--Documentation/networking/ethtool-netlink.rst263
-rw-r--r--Documentation/networking/index.rst6
-rw-r--r--Documentation/networking/ip-sysctl.rst41
-rw-r--r--Documentation/networking/iso15765-2.rst386
-rw-r--r--Documentation/networking/l2tp.rst54
-rw-r--r--Documentation/networking/mptcp-sysctl.rst81
-rw-r--r--Documentation/networking/mptcp.rst156
-rw-r--r--Documentation/networking/multi-pf-netdev.rst10
-rw-r--r--Documentation/networking/napi.rst5
-rw-r--r--Documentation/networking/net_cachelines/net_device.rst11
-rw-r--r--Documentation/networking/net_dim.rst42
-rw-r--r--Documentation/networking/netdev-features.rst15
-rw-r--r--Documentation/networking/netdevices.rst4
-rw-r--r--Documentation/networking/oa-tc6-framework.rst497
-rw-r--r--Documentation/networking/phy-link-topology.rst121
-rw-r--r--Documentation/networking/phy.rst6
-rw-r--r--Documentation/networking/sriov.rst25
-rw-r--r--Documentation/networking/switchdev.rst4
-rw-r--r--Documentation/networking/tcp_ao.rst9
-rw-r--r--Documentation/networking/timestamping.rst20
-rw-r--r--Documentation/networking/tproxy.rst2
-rw-r--r--Documentation/networking/xsk-tx-metadata.rst16
32 files changed, 2024 insertions, 142 deletions
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index a4c7d0c65fd7..4561e8ab9e08 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -230,6 +230,11 @@ per-queue stats) from the device.
In addition the driver logs the stats to syslog upon device reset.
+On supported instance types, the statistics will also include the
+ENA Express data (fields prefixed with `ena_srd`). For a complete
+documentation of ENA Express data refer to
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html#ena-express-monitor
+
MTU
===
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index 199647729251..32ee827a3a2c 100644
--- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -339,7 +339,7 @@ Key functions include:
a bind of the root DPRC to the DPRC driver
The binding for the MC-bus device-tree node can be consulted at
-*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
+*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml*.
The sysfs bind/unbind interfaces for the MC-bus can be consulted at
*Documentation/ABI/testing/sysfs-bus-fsl-mc*.
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 6932d8c043c2..6fc1961492b7 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -44,6 +44,7 @@ Contents:
marvell/octeon_ep
marvell/octeon_ep_vf
mellanox/mlx5/index
+ meta/fbnic
microsoft/netvsc
neterion/s2io
netronome/nfp
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index fed821ef9b09..99d95be4d159 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -189,22 +189,19 @@ the software port.
* - `rx[i]_gro_packets`
- Number of received packets processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded packets received on ring i.
+ number of hardware GRO offloaded packets received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_bytes`
- Number of received bytes processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded bytes received on ring i.
+ number of hardware GRO offloaded bytes received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_skbs`
- - The number of receive SKBs constructed while performing
- hardware-accelerated GRO.
- - Informative
-
- * - `rx[i]_gro_match_packets`
- - Number of received packets processed using hardware-accelerated GRO that
- met the flow table match criteria.
+ - The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs
+ with a GRO count > 1 are counted.
- Informative
* - `rx[i]_gro_large_hds`
@@ -212,6 +209,31 @@ the software port.
headers that require additional memory to be allocated.
- Informative
+ * - `rx[i]_hds_nodata_packets`
+ - Number of header only packets in header/data split mode [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nodata_bytes`
+ - Number of bytes for header only packets in header/data split mode
+ [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nosplit_packets`
+ - Number of packets that were not split in header/data split mode. A
+ packet will not get split when the hardware does not support its
+ protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+ TCP and UDP with IPv4/IPv6 are supported for header/data split
+ [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nosplit_bytes`
+ - Number of bytes for packets that were not split in header/data split
+ mode. A packet will not get split when the hardware does not support its
+ protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+ TCP and UDP with IPv4/IPv6 are supported for header/data split
+ [#accel]_.
+ - Informative
+
* - `rx[i]_lro_packets`
- The number of LRO packets received on ring i [#accel]_.
- Acceleration
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
index 20d3b7e87049..34e911480108 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -130,6 +130,9 @@ Enabling the driver and kconfig options
| Build support for software-managed steering in the NIC.
+**CONFIG_MLX5_HW_STEERING=(y/n)**
+
+| Build support for hardware-managed steering in the NIC.
**CONFIG_MLX5_TC_CT=(y/n)**
diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
new file mode 100644
index 000000000000..32ff114f5c26
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=====================================
+Meta Platforms Host Network Interface
+=====================================
+
+Firmware Versions
+-----------------
+
+fbnic has three components stored on the flash which are provided in one PLDM
+image:
+
+1. fw - The control firmware used to view and modify firmware settings, request
+ firmware actions, and retrieve firmware counters outside of the data path.
+ This is the firmware which fbnic_fw.c interacts with.
+2. bootloader - The firmware which validate firmware security and control basic
+ operations including loading and updating the firmware. This is also known
+ as the cmrt firmware.
+3. undi - This is the UEFI driver which is based on the Linux driver.
+
+fbnic stores two copies of these three components on flash. This allows fbnic
+to fall back to an older version of firmware automatically in case firmware
+fails to boot. Version information for both is provided as running and stored.
+The undi is only provided in stored as it is not actively running once the Linux
+driver takes over.
+
+devlink dev info provides version information for all three components. In
+addition to the version the hg commit hash of the build is included as a
+separate entry.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 9232cd7da301..5d0b68f752c0 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -49,7 +49,7 @@ example usage
$ devlink region show [ DEV/REGION ]
$ devlink region del DEV/REGION snapshot SNAPSHOT_ID
$ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
- $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length
+ $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length LENGTH
# Show all of the exposed regions with region sizes:
$ devlink region show
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 830c04354222..e3972d03cea0 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -11,6 +11,7 @@ Parameters
==========
.. list-table:: Generic parameters implemented
+ :widths: 5 5 90
* - Name
- Mode
@@ -68,6 +69,30 @@ Parameters
To verify that value has been set:
$ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
+.. list-table:: Driver specific parameters implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Mode
+ - Description
+ * - ``local_forwarding``
+ - runtime
+ - Controls loopback behavior by tuning scheduler bandwidth.
+ It impacts all kinds of functions: physical, virtual and
+ subfunctions.
+ Supported values are:
+
+ ``enabled`` - loopback traffic is allowed on port
+
+ ``disabled`` - loopback traffic is not allowed on this port
+
+ ``prioritized`` - loopback traffic is prioritized on this port
+
+ Default value of ``local_forwarding`` parameter is ``enabled``.
+ ``prioritized`` provides ability to adjust loopback traffic rate to increase
+ one port capacity at cost of the another. User needs to disable
+ local forwarding on one of the ports in order have increased capacity
+ on the ``prioritized`` port.
Info versions
=============
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
index 610de99b728a..d33a90dd44bf 100644
--- a/Documentation/networking/devlink/octeontx2.rst
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -40,3 +40,19 @@ The ``octeontx2 AF`` driver implements the following driver-specific parameters.
- runtime
- Use to set the quantum which hardware uses for scheduling among transmit queues.
Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
+
+The ``octeontx2 PF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``unicast_filter_count``
+ - u8
+ - runtime
+ - Set the maximum number of unicast filters that can be programmed for
+ the device. This can be used to achieve better device resource
+ utilization, avoiding over consumption of unused MCAM table entries.
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
new file mode 100644
index 000000000000..a55bf21f671c
--- /dev/null
+++ b/Documentation/networking/devmem.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Device Memory TCP
+=================
+
+
+Intro
+=====
+
+Device memory TCP (devmem TCP) enables receiving data directly into device
+memory (dmabuf). The feature is currently implemented for TCP sockets.
+
+
+Opportunity
+-----------
+
+A large number of data transfers have device memory as the source and/or
+destination. Accelerators drastically increased the prevalence of such
+transfers. Some examples include:
+
+- Distributed training, where ML accelerators, such as GPUs on different hosts,
+ exchange data.
+
+- Distributed raw block storage applications transfer large amounts of data with
+ remote SSDs. Much of this data does not require host processing.
+
+Typically the Device-to-Device data transfers in the network are implemented as
+the following low-level operations: Device-to-Host copy, Host-to-Host network
+transfer, and Host-to-Device copy.
+
+The flow involving host copies is suboptimal, especially for bulk data transfers,
+and can put significant strains on system resources such as host memory
+bandwidth and PCIe bandwidth.
+
+Devmem TCP optimizes this use case by implementing socket APIs that enable
+the user to receive incoming network packets directly into device memory.
+
+Packet payloads go directly from the NIC to device memory.
+
+Packet headers go to host memory and are processed by the TCP/IP stack
+normally. The NIC must support header split to achieve this.
+
+Advantages:
+
+- Alleviate host memory bandwidth pressure, compared to existing
+ network-transfer + device-copy semantics.
+
+- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
+ level of the PCIe tree, compared to the traditional path which sends data
+ through the root complex.
+
+
+More Info
+---------
+
+ slides, video
+ https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
+
+ patchset
+ [PATCH net-next v24 00/13] Device Memory TCP
+ https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
+
+
+Interface
+=========
+
+
+Example
+-------
+
+tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
+the RX path of this API.
+
+
+NIC Setup
+---------
+
+Header split, flow steering, & RSS are required features for devmem TCP.
+
+Header split is used to split incoming packets into a header buffer in host
+memory, and a payload buffer in device memory.
+
+Flow steering & RSS are used to ensure that only flows targeting devmem land on
+an RX queue bound to devmem.
+
+Enable header split & flow steering::
+
+ # enable header split
+ ethtool -G eth1 tcp-data-split on
+
+
+ # enable flow steering
+ ethtool -K eth1 ntuple on
+
+Configure RSS to steer all traffic away from the target RX queue (queue 15 in
+this example)::
+
+ ethtool --set-rxfh-indir eth1 equal 15
+
+
+The user must bind a dmabuf to any number of RX queues on a given NIC using
+the netlink API::
+
+ /* Bind dmabuf to NIC RX queue 15 */
+ struct netdev_queue *queues;
+ queues = malloc(sizeof(*queues) * 1);
+
+ queues[0]._present.type = 1;
+ queues[0]._present.idx = 1;
+ queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
+ queues[0].idx = 15;
+
+ *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_rx_req_alloc();
+ netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
+ netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
+ __netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
+
+ rsp = netdev_bind_rx(*ys, req);
+
+ dmabuf_id = rsp->dmabuf_id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+
+Socket Setup
+------------
+
+The socket must be flow steered to the dmabuf bound RX queue::
+
+ ethtool -N eth1 flow-type tcp4 ... queue 15
+
+
+Receiving data
+--------------
+
+The user application must signal to the kernel that it is capable of receiving
+devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
+
+ ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
+
+Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
+on devmem data.
+
+Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
+Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
+
+ for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (cm->cmsg_level != SOL_SOCKET ||
+ (cm->cmsg_type != SCM_DEVMEM_DMABUF &&
+ cm->cmsg_type != SCM_DEVMEM_LINEAR))
+ continue;
+
+ dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
+
+ if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
+ /* Frag landed in dmabuf.
+ *
+ * dmabuf_cmsg->dmabuf_id is the dmabuf the
+ * frag landed on.
+ *
+ * dmabuf_cmsg->frag_offset is the offset into
+ * the dmabuf where the frag starts.
+ *
+ * dmabuf_cmsg->frag_size is the size of the
+ * frag.
+ *
+ * dmabuf_cmsg->frag_token is a token used to
+ * refer to this frag for later freeing.
+ */
+
+ struct dmabuf_token token;
+ token.token_start = dmabuf_cmsg->frag_token;
+ token.token_count = 1;
+ continue;
+ }
+
+ if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
+ /* Frag landed in linear buffer.
+ *
+ * dmabuf_cmsg->frag_size is the size of the
+ * frag.
+ */
+ continue;
+
+ }
+
+Applications may receive 2 cmsgs:
+
+- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
+ by dmabuf_id.
+
+- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
+ This typically happens when the NIC is unable to split the packet at the
+ header boundary, such that part (or all) of the payload landed in host
+ memory.
+
+Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
+regular TCP data that landed on an RX queue not bound to a dmabuf.
+
+
+Freeing frags
+-------------
+
+Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
+processes the frag. The user must return the frag to the kernel via
+SO_DEVMEM_DONTNEED::
+
+ ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
+ sizeof(token));
+
+The user must ensure the tokens are returned to the kernel in a timely manner.
+Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
+and will lead to packet drops.
+
+
+Implementation & Caveats
+========================
+
+Unreadable skbs
+---------------
+
+Devmem payloads are inaccessible to the kernel processing the packets. This
+results in a few quirks for payloads of devmem skbs:
+
+- Loopback is not functional. Loopback relies on copying the payload, which is
+ not possible with devmem skbs.
+
+- Software checksum calculation fails.
+
+- TCP Dump and bpf can't access devmem packet payloads.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+``tools/testing/selftests/net/ncdevmem.c``
+
+ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
+receives data directly into a udmabuf.
+
+To run ncdevmem, you need to run it on a server on the machine under test, and
+you need to run netcat on a peer to provide the TX data.
+
+ncdevmem has a validation mode as well that expects a repeating pattern of
+incoming data and validates it as such. For example, you can launch
+ncdevmem on the server by::
+
+ ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
+ -p 5201 -v 7
+
+On client side, use regular netcat to send TX data to ncdevmem process
+on the server::
+
+ yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
+ tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 160bfb0ae8ba..295563e91082 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -57,6 +57,7 @@ Structure of this header is
``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex
``ETHTOOL_A_HEADER_DEV_NAME`` string device name
``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests
+ ``ETHTOOL_A_HEADER_PHY_INDEX`` u32 phy device index
============================== ====== =============================
``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the
@@ -81,6 +82,12 @@ the behaviour is backward compatible, i.e. requests from old clients not aware
of the flag should be interpreted the way the client expects. A client must
not set flags it does not understand.
+``ETHTOOL_A_HEADER_PHY_INDEX`` identifies the Ethernet PHY the message relates to.
+As there are numerous commands that are related to PHY configuration, and because
+there may be more than one PHY on the link, the PHY index can be passed in the
+request for the commands that needs it. It is, however, not mandatory, and if it
+is not passed for commands that target a PHY, the net_device.phydev pointer
+is used.
Bit sets
========
@@ -228,6 +235,7 @@ Userspace to kernel:
``ETHTOOL_MSG_PLCA_GET_STATUS`` get PLCA RS status
``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` flash transceiver module firmware
===================================== =================================
Kernel to userspace:
@@ -274,6 +282,7 @@ Kernel to userspace:
``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status
``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters
``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF`` transceiver module flash updates
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -932,18 +941,18 @@ Request contents:
==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
attributes.
``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
-Completion queue events(CQE) are the events posted by NIC to indicate the
-completion status of a packet when the packet is sent(like send success or
-error) or received(like pointers to packet fragments). The CQE size parameter
+Completion queue events (CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent (like send success or
+error) or received (like pointers to packet fragments). The CQE size parameter
enables to modify the CQE size other than default size if NIC supports it.
-A bigger CQE can have more receive buffer pointers inturn NIC can transfer
-a bigger frame from wire. Based on the NIC hardware, the overall completion
-queue size can be adjusted in the driver if CQE size is modified.
+A bigger CQE can have more receive buffer pointers, and in turn the NIC can
+transfer a bigger frame from wire. Based on the NIC hardware, the overall
+completion queue size can be adjusted in the driver if CQE size is modified.
CHANNELS_GET
============
@@ -987,7 +996,7 @@ Request contents:
===================================== ====== ==========================
Kernel checks that requested channel counts do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
attributes.
@@ -1033,6 +1042,8 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@@ -1062,6 +1073,10 @@ block should be sent.
This feature is mainly of interest for specific USB devices which does not cope
well with frequent small-sized URBs transmissions.
+``ETHTOOL_A_COALESCE_RX_PROFILE`` and ``ETHTOOL_A_COALESCE_TX_PROFILE`` refer
+to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
+<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
+
COALESCE_SET
============
@@ -1098,6 +1113,8 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
@@ -1297,12 +1314,17 @@ information.
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULT_SRC`` | u32 | information source |
+ +-+-+-----------------------------------------+--------+---------------------+
| | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH`` | nested | cable length |
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR`` | u8 | pair number |
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM`` | u32 | length in cm |
+-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_SRC`` | u32 | information source |
+ +-+-+-----------------------------------------+--------+---------------------+
+
CABLE_TEST TDR
==============
@@ -1720,22 +1742,33 @@ Request contents:
Kernel response contents:
- ====================================== ====== =============================
- ``ETHTOOL_A_PSE_HEADER`` nested reply header
- ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
- PSE functions
- ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
- PoDL PSE.
- ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE
- PSE functions.
- ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the
- PoE PSE.
- ====================================== ====== =============================
+ ========================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE
+ PSE functions.
+ ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_CLASS`` u32 power class of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` u32 actual power drawn on the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_STATE`` u32 power extended state of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` u32 power extended substatus of
+ the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` u32 currently configured power
+ limit of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit
+ configuration ranges.
+ ========================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
the operational state of the PoDL PSE functions. The operational state of the
PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
-action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2
+action. This attribute corresponds to ``IEEE 802.3-2018`` 30.15.1.1.2
aPoDLPSEAdminState. Possible values are:
.. kernel-doc:: include/uapi/linux/ethtool.h
@@ -1749,8 +1782,8 @@ The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` implementing
When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
the power detection status of the PoDL PSE. The status depend on internal PSE
-state machine and automatic PD classification support. This option is
-corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+state machine and automatic PD classification support. This attribute
+corresponds to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
Possible values are:
.. kernel-doc:: include/uapi/linux/ethtool.h
@@ -1762,6 +1795,47 @@ The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_PW_D_STATUS`` implementing
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_c33_pse_pw_d_status
+When set, the optional ``ETHTOOL_A_C33_PSE_PW_CLASS`` attribute identifies
+the power class of the C33 PSE. It depends on the class negotiated between
+the PSE and the PD. This attribute corresponds to ``IEEE 802.3-2022``
+30.9.1.1.8 aPSEPowerClassification.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` attribute identifies
+the actual power drawn by the C33 PSE. This attribute corresponds to
+``IEEE 802.3-2022`` 30.9.1.1.23 aPSEActualPower. Actual power is reported
+in mW.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_STATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_state
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_substate_class_num_events
+ ethtool_c33_pse_ext_substate_error_condition
+ ethtool_c33_pse_ext_substate_mr_pse_enable
+ ethtool_c33_pse_ext_substate_option_detect_ted
+ ethtool_c33_pse_ext_substate_option_vport_lim
+ ethtool_c33_pse_ext_substate_ovld_detected
+ ethtool_c33_pse_ext_substate_pd_dll_power_type
+ ethtool_c33_pse_ext_substate_power_not_available
+ ethtool_c33_pse_ext_substate_short_detected
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` attribute
+identifies the C33 PSE power limit in mW.
+
+When set the optional ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested attribute
+identifies the C33 PSE power limit ranges through
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MIN`` and
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MAX``.
+If the controller works with fixed classes, the min and max values will be
+equal.
+
PSE_SET
=======
@@ -1773,16 +1847,30 @@ Request contents:
``ETHTOOL_A_PSE_HEADER`` nested request header
``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state
+ ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available
+ power limit
====================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
-to control PoDL PSE Admin functions. This option is implementing
+to control PoDL PSE Admin functions. This option implements
``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` implementing
``IEEE 802.3-2022`` 30.9.1.2.1 acPSEAdminControl.
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` attribute is
+used to control the available power value limit for C33 PSE in milliwatts.
+This attribute corresponds to the `pse_available_power` variable described in
+``IEEE 802.3-2022`` 33.2.4.4 Variables and `pse_avail_pwr` in 145.2.5.4
+Variables, which are described in power classes.
+
+It was decided to use milliwatts for this interface to unify it with other
+power monitoring interfaces, which also use milliwatts, and to align with
+various existing products that document power consumption in watts rather than
+classes. If power limit configuration based on classes is needed, the
+conversion can be done in user space, for example by ethtool.
+
RSS_GET
=======
@@ -1791,15 +1879,24 @@ RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
Request contents:
-===================================== ====== ==========================
+===================================== ====== ============================
``ETHTOOL_A_RSS_HEADER`` nested request header
``ETHTOOL_A_RSS_CONTEXT`` u32 context number
-===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_START_CONTEXT`` u32 start context number (dumps)
+===================================== ====== ============================
+
+``ETHTOOL_A_RSS_CONTEXT`` specifies which RSS context number to query,
+if not set context 0 (the main context) is queried. Dumps can be filtered
+by device (only listing contexts of a given netdev). Filtering single
+context number is not supported but ``ETHTOOL_A_RSS_START_CONTEXT``
+can be used to start dumping context from the given number (primarily
+used to ignore context 0s and only dump additional contexts).
Kernel response contents:
===================================== ====== ==========================
``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
@@ -1851,7 +1948,7 @@ When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
standard and version the PLCA management interface complies to. When not set,
the interface is vendor-specific and (possibly) supplied by the driver.
The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
-embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management
+embedding the PLCA Reconciliation Sublayer. See "10BASE-T1S PLCA Management
Registers" at https://www.opensig.org/about/specifications/.
When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
@@ -1913,7 +2010,7 @@ Request contents:
``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID
``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the
- netkork, including the
+ network, including the
coordinator
``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer
value in bit-times (BT)
@@ -2033,6 +2130,116 @@ The attributes are propagated to the driver through the following structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_cfg
+MODULE_FW_FLASH_ACT
+===================
+
+Flashes transceiver module firmware.
+
+Request contents:
+
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` string firmware image file name
+ ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` u32 transceiver module password
+ ======================================= ====== ===========================
+
+The firmware update process consists of three logical steps:
+
+1. Downloading a firmware image to the transceiver module and validating it.
+2. Running the firmware image.
+3. Committing the firmware image so that it is run upon reset.
+
+When flash command is given, those three steps are taken in that order.
+
+This message merely schedules the update process and returns immediately
+without blocking. The process then runs asynchronously.
+Since it can take several minutes to complete, during the update process
+notifications are emitted from the kernel to user space updating it about
+the status and progress.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` attribute encodes the firmware
+image file name. The firmware image is downloaded to the transceiver module,
+validated, run and committed.
+
+The optional ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` attribute encodes a password
+that might be required as part of the transceiver module firmware update
+process.
+
+The firmware update process can take several minutes to complete. Therefore,
+during the update process notifications are emitted from the kernel to user
+space updating it about the status and progress.
+
+
+
+Notification contents:
+
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` | nested | reply header |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` | u32 | status |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` | string | status message |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` | uint | progress |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` | uint | total |
+ +---------------------------------------------------+--------+----------------+
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` attribute encodes the current status
+of the firmware update process. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_fw_flash_status
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` attribute encodes a status message
+string.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL``
+attributes encode the completed and total amount of work, respectively.
+
+PHY_GET
+=======
+
+Retrieve information about a given Ethernet PHY sitting on the link. The DO
+operation returns all available information about dev->phydev. User can also
+specify a PHY_INDEX, in which case the DO request returns information about that
+specific PHY.
+
+As there can be more than one PHY, the DUMP operation can be used to list the PHYs
+present on a given interface, by passing an interface index or name in
+the dump request.
+
+For more information, refer to :ref:`phy_link_topology`
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHY_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ===============================
+ ``ETHTOOL_A_PHY_HEADER`` nested request header
+ ``ETHTOOL_A_PHY_INDEX`` u32 the phy's unique index, that can
+ be used for phy-specific
+ requests
+ ``ETHTOOL_A_PHY_DRVNAME`` string the phy driver name
+ ``ETHTOOL_A_PHY_NAME`` string the phy device name
+ ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` u32 the type of device this phy is
+ connected to
+ ``ETHTOOL_A_PHY_UPSTREAM_INDEX`` u32 the PHY index of the upstream
+ PHY
+ ``ETHTOOL_A_PHY_UPSTREAM_SFP_NAME`` string if this PHY is connected to
+ its parent PHY through an SFP
+ bus, the name of this sfp bus
+ ``ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME`` string if the phy controls an sfp bus,
+ the name of the sfp bus
+ ===================================== ====== ===============================
+
+When ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` is PHY_UPSTREAM_PHY, the PHY's parent is
+another PHY.
+
Request translation
===================
@@ -2139,4 +2346,6 @@ are netlink only.
n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
n/a ``ETHTOOL_MSG_MM_GET``
n/a ``ETHTOOL_MSG_MM_SET``
+ n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT``
+ n/a ``ETHTOOL_MSG_PHY_GET``
=================================== =====================================
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 7664c0bfe461..803dfc1efb75 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -19,6 +19,7 @@ Contents:
caif/index
ethtool-netlink
ieee802154
+ iso15765-2
j1939
kapi
msg_zerocopy
@@ -48,6 +49,7 @@ Contents:
cdc_mbim
dccp
dctcp
+ devmem
dns_resolver
driver
eql
@@ -72,6 +74,7 @@ Contents:
mac80211-injection
mctp
mpls-sysctl
+ mptcp
mptcp-sysctl
multiqueue
multi-pf-netdev
@@ -85,10 +88,12 @@ Contents:
nexthop-group-resilient
nf_conntrack-sysctl
nf_flowtable
+ oa-tc6-framework
openvswitch
operstates
packet_mmap
phonet
+ phy-link-topology
pktgen
plip
ppp_generic
@@ -104,6 +109,7 @@ Contents:
seg6-sysctl
skbuff
smc-sysctl
+ sriov
statistics
strparser
switchdev
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index bd50df6a5a42..eacf8983e230 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -131,6 +131,20 @@ fib_multipath_hash_fields - UNSIGNED INTEGER
Default: 0x0007 (source IP, destination IP and IP protocol)
+fib_multipath_hash_seed - UNSIGNED INTEGER
+ The seed value used when calculating hash for multipath routes. Applies
+ to both IPv4 and IPv6 datapath. Only present for kernels built with
+ CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+ When set to 0, the seed value used for multipath routing defaults to an
+ internal random-generated one.
+
+ The actual hashing algorithm is not specified -- there is no guarantee
+ that a next hop distribution effected by a given seed will keep stable
+ across kernel versions.
+
+ Default: 0 (random)
+
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
synchronize_rcu is forced.
@@ -1196,6 +1210,19 @@ tcp_pingpong_thresh - INTEGER
Default: 1
+tcp_rto_min_us - INTEGER
+ Minimal TCP retransmission timeout (in microseconds). Note that the
+ rto_min route option has the highest precedence for configuring this
+ setting, followed by the TCP_BPF_RTO_MIN socket option, followed by
+ this tcp_rto_min_us sysctl.
+
+ The recommended practice is to use a value less or equal to 200000
+ microseconds.
+
+ Possible Values: 1 - INT_MAX
+
+ Default: 200000
+
UDP variables
=============
@@ -2335,6 +2362,20 @@ ra_honor_pio_life - BOOLEAN
Default: 0 (disabled)
+ra_honor_pio_pflag - BOOLEAN
+ The Prefix Information Option P-flag indicates the network can
+ allocate a unique IPv6 prefix per client using DHCPv6-PD.
+ This sysctl can be enabled when a userspace DHCPv6-PD client
+ is running to cause the P-flag to take effect: i.e. the
+ P-flag suppresses any effects of the A-flag within the same
+ PIO. For a given PIO, P=1 and A=1 is treated as A=0.
+
+ - If disabled, the P-flag is ignored.
+ - If enabled, the P-flag will disable SLAAC autoconfiguration
+ for the given Prefix Information Option.
+
+ Default: 0 (disabled)
+
accept_ra_rt_info_min_plen - INTEGER
Minimum prefix length of Route Information in RA.
diff --git a/Documentation/networking/iso15765-2.rst b/Documentation/networking/iso15765-2.rst
new file mode 100644
index 000000000000..0e9d96074178
--- /dev/null
+++ b/Documentation/networking/iso15765-2.rst
@@ -0,0 +1,386 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+====================
+ISO 15765-2 (ISO-TP)
+====================
+
+Overview
+========
+
+ISO 15765-2, also known as ISO-TP, is a transport protocol specifically defined
+for diagnostic communication on CAN. It is widely used in the automotive
+industry, for example as the transport protocol for UDSonCAN (ISO 14229-3) or
+emission-related diagnostic services (ISO 15031-5).
+
+ISO-TP can be used both on CAN CC (aka Classical CAN) and CAN FD (CAN with
+Flexible Datarate) based networks. It is also designed to be compatible with a
+CAN network using SAE J1939 as data link layer (however, this is not a
+requirement).
+
+Specifications used
+-------------------
+
+* ISO 15765-2:2024 : Road vehicles - Diagnostic communication over Controller
+ Area Network (DoCAN). Part 2: Transport protocol and network layer services.
+
+Addressing
+----------
+
+In its simplest form, ISO-TP is based on two kinds of addressing modes for the
+nodes connected to the same network:
+
+* physical addressing is implemented by two node-specific addresses and is used
+ in 1-to-1 communication.
+
+* functional addressing is implemented by one node-specific address and is used
+ in 1-to-N communication.
+
+Three different addressing formats can be employed:
+
+* "normal" : each address is represented simply by a CAN ID.
+
+* "extended": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; both the CAN ID and the byte inside the payload shall be
+ different between two addresses.
+
+* "mixed": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; the CAN ID is different between two addresses, but the
+ additional byte is the same.
+
+Transport protocol and associated frame types
+---------------------------------------------
+
+When transmitting data using the ISO-TP protocol, the payload can either fit
+inside one single CAN message or not, also considering the overhead the protocol
+is generating and the optional extended addressing. In the first case, the data
+is transmitted at once using a so-called Single Frame (SF). In the second case,
+ISO-TP defines a multi-frame protocol, in which the sender provides (through a
+First Frame - FF) the PDU length which is to be transmitted and also asks for a
+Flow Control (FC) frame, which provides the maximum supported size of a macro
+data block (``blocksize``) and the minimum time between the single CAN messages
+composing such block (``stmin``). Once this information has been received, the
+sender starts to send frames containing fragments of the data payload (called
+Consecutive Frames - CF), stopping after every ``blocksize``-sized block to wait
+confirmation from the receiver which should then send another Flow Control
+frame to inform the sender about its availability to receive more data.
+
+How to Use ISO-TP
+=================
+
+As with others CAN protocols, the ISO-TP stack support is built into the
+Linux network subsystem for the CAN bus, aka. Linux-CAN or SocketCAN, and
+thus follows the same socket API.
+
+Creation and basic usage of an ISO-TP socket
+--------------------------------------------
+
+To use the ISO-TP stack, ``#include <linux/can/isotp.h>`` shall be used. A
+socket can then be created using the ``PF_CAN`` protocol family, the
+``SOCK_DGRAM`` type (as the underlying protocol is datagram-based by design)
+and the ``CAN_ISOTP`` protocol:
+
+.. code-block:: C
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+
+After the socket has been successfully created, ``bind(2)`` shall be called to
+bind the socket to the desired CAN interface; to do so:
+
+* a TX CAN ID shall be specified as part of the sockaddr supplied to the call
+ itself.
+
+* a RX CAN ID shall also be specified, unless broadcast flags have been set
+ through socket option (explained below).
+
+Once bound to an interface, the socket can be read from and written to using
+the usual ``read(2)`` and ``write(2)`` system calls, as well as ``send(2)``,
+``sendmsg(2)``, ``recv(2)`` and ``recvmsg(2)``.
+Unlike the CAN_RAW socket API, only the ISO-TP data field (the actual payload)
+is sent and received by the userspace application using these calls. The address
+information and the protocol information are automatically filled by the ISO-TP
+stack using the configuration supplied during socket creation. In the same way,
+the stack will use the transport mechanism when required (i.e., when the size
+of the data payload exceeds the MTU of the underlying CAN bus).
+
+The sockaddr structure used for SocketCAN has extensions for use with ISO-TP,
+as specified below:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ struct { canid_t rx_id, tx_id; } tp;
+ ...
+ } can_addr;
+ }
+
+* ``can_family`` and ``can_ifindex`` serve the same purpose as for other
+ SocketCAN sockets.
+
+* ``can_addr.tp.rx_id`` specifies the receive (RX) CAN ID and will be used as
+ a RX filter.
+
+* ``can_addr.tp.tx_id`` specifies the transmit (TX) CAN ID
+
+ISO-TP socket options
+---------------------
+
+When creating an ISO-TP socket, reasonable defaults are set. Some options can
+be modified with ``setsockopt(2)`` and/or read back with ``getsockopt(2)``.
+
+General options
+~~~~~~~~~~~~~~~
+
+General socket options can be passed using the ``CAN_ISOTP_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_options opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_OPTS, &opts, sizeof(opts))
+
+where the ``can_isotp_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u32 flags;
+ u32 frame_txtime;
+ u8 ext_address;
+ u8 txpad_content;
+ u8 rxpad_content;
+ u8 rx_ext_address;
+ };
+
+* ``flags``: modifiers to be applied to the default behaviour of the ISO-TP
+ stack. Following flags are available:
+
+ * ``CAN_ISOTP_LISTEN_MODE``: listen only (do not send FC frames); normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_EXTEND_ADDR``: use the byte specified in ``ext_address`` as an
+ additional address component. This enables the "mixed" addressing format if
+ used alone, or the "extended" addressing format if used in conjunction with
+ ``CAN_ISOTP_RX_EXT_ADDR``.
+
+ * ``CAN_ISOTP_TX_PADDING``: enable padding for transmitted frames, using
+ ``txpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_RX_PADDING``: enable padding for the received frames, using
+ ``rxpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_CHK_PAD_LEN``: check for correct padding length on the received
+ frames.
+
+ * ``CAN_ISOTP_CHK_PAD_DATA``: check padding bytes on the received frames
+ against ``rxpad_content``; if ``CAN_ISOTP_RX_PADDING`` is not specified,
+ this flag is ignored.
+
+ * ``CAN_ISOTP_HALF_DUPLEX``: force ISO-TP socket in half duplex mode
+ (that is, transport mechanism can only be incoming or outgoing at the same
+ time, not both).
+
+ * ``CAN_ISOTP_FORCE_TXSTMIN``: ignore stmin from received FC; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_FORCE_RXSTMIN``: ignore CFs depending on rx stmin; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_RX_EXT_ADDR``: use ``rx_ext_address`` instead of ``ext_address``
+ as extended addressing byte on the reception path. If used in conjunction
+ with ``CAN_ISOTP_EXTEND_ADDR``, this flag effectively enables the "extended"
+ addressing format.
+
+ * ``CAN_ISOTP_WAIT_TX_DONE``: wait until the frame is sent before returning
+ from ``write(2)`` and ``send(2)`` calls (i.e., blocking write operations).
+
+ * ``CAN_ISOTP_SF_BROADCAST``: use 1-to-N functional addressing (cannot be
+ specified alongside ``CAN_ISOTP_CF_BROADCAST``).
+
+ * ``CAN_ISOTP_CF_BROADCAST``: use 1-to-N transmission without flow control
+ (cannot be specified alongside ``CAN_ISOTP_SF_BROADCAST``).
+ NOTE: this is not covered by the ISO 15765-2 standard.
+
+ * ``CAN_ISOTP_DYN_FC_PARMS``: enable dynamic update of flow control
+ parameters.
+
+* ``frame_txtime``: frame transmission time (defined as N_As/N_Ar inside the
+ ISO standard); if ``0``, the default (or the last set value) is used.
+ To set the transmission time to ``0``, the ``CAN_ISOTP_FRAME_TXTIME_ZERO``
+ macro (equal to 0xFFFFFFFF) shall be used.
+
+* ``ext_address``: extended addressing byte, used if the
+ ``CAN_ISOTP_EXTEND_ADDR`` flag is specified.
+
+* ``txpad_content``: byte used as padding value for transmitted frames.
+
+* ``rxpad_content``: byte used as padding value for received frames.
+
+* ``rx_ext_address``: extended addressing byte for the reception path, used if
+ the ``CAN_ISOTP_RX_EXT_ADDR`` flag is specified.
+
+Flow Control options
+~~~~~~~~~~~~~~~~~~~~
+
+Flow Control (FC) options can be passed using the ``CAN_ISOTP_RECV_FC`` optname
+to provide the communication parameters for receiving ISO-TP PDUs.
+
+.. code-block:: C
+
+ struct can_isotp_fc_options fc_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RECV_FC, &fc_opts, sizeof(fc_opts));
+
+where the ``can_isotp_fc_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u8 bs;
+ u8 stmin;
+ u8 wftmax;
+ };
+
+* ``bs``: blocksize provided in flow control frames.
+
+* ``stmin``: minimum separation time provided in flow control frames; can
+ have the following values (others are reserved):
+
+ * 0x00 - 0x7F : 0 - 127 ms
+
+ * 0xF1 - 0xF9 : 100 us - 900 us
+
+* ``wftmax``: maximum number of wait frames provided in flow control frames.
+
+Link Layer options
+~~~~~~~~~~~~~~~~~~
+
+Link Layer (LL) options can be passed using the ``CAN_ISOTP_LL_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options ll_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_LL_OPTS, &ll_opts, sizeof(ll_opts));
+
+where the ``can_isotp_ll_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options {
+ u8 mtu;
+ u8 tx_dl;
+ u8 tx_flags;
+ };
+
+* ``mtu``: generated and accepted CAN frame type, can be equal to ``CAN_MTU``
+ for classical CAN frames or ``CANFD_MTU`` for CAN FD frames.
+
+* ``tx_dl``: maximum payload length for transmitted frames, can have one value
+ among: 8, 12, 16, 20, 24, 32, 48, 64. Values above 8 only apply to CAN FD
+ traffic (i.e.: ``mtu = CANFD_MTU``).
+
+* ``tx_flags``: flags set into ``struct canfd_frame.flags`` at frame creation.
+ Only applies to CAN FD traffic (i.e.: ``mtu = CANFD_MTU``).
+
+Transmission stmin
+~~~~~~~~~~~~~~~~~~
+
+The transmission minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_TX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; this will overwrite the value sent by the receiver in
+flow control frames:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_TX_STMIN, &stmin, sizeof(stmin));
+
+Reception stmin
+~~~~~~~~~~~~~~~
+
+The reception minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_RX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; received Consecutive Frames (CF) which timestamps
+differ less than this value will be ignored:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RX_STMIN, &stmin, sizeof(stmin));
+
+Multi-frame transport support
+-----------------------------
+
+The ISO-TP stack contained inside the Linux kernel supports the multi-frame
+transport mechanism defined by the standard, with the following constraints:
+
+* the maximum size of a PDU is defined by a module parameter, with an hard
+ limit imposed at build time.
+
+* when a transmission is in progress, subsequent calls to ``write(2)`` will
+ block, while calls to ``send(2)`` will either block or fail depending on the
+ presence of the ``MSG_DONTWAIT`` flag.
+
+* no support is present for sending "wait frames": whether a PDU can be fully
+ received or not is decided when the First Frame is received.
+
+Errors
+------
+
+Following errors are reported to userspace:
+
+RX path errors
+~~~~~~~~~~~~~~
+
+============ ===============================================================
+-ETIMEDOUT timeout of data reception
+-EILSEQ sequence number mismatch during a multi-frame reception
+-EBADMSG data reception with wrong padding
+============ ===============================================================
+
+TX path errors
+~~~~~~~~~~~~~~
+
+========== =================================================================
+-ECOMM flow control reception timeout
+-EMSGSIZE flow control reception overflow
+-EBADMSG flow control reception with wrong layout/padding
+========== =================================================================
+
+Examples
+========
+
+Basic node example
+------------------
+
+Following example implements a node using "normal" physical addressing, with
+RX ID equal to 0x18DAF142 and a TX ID equal to 0x18DA42F1. All options are left
+to their default.
+
+.. code-block:: C
+
+ int s;
+ struct sockaddr_can addr;
+ int ret;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+ if (s < 0)
+ exit(1);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = if_nametoindex("can0");
+ addr.tp.tx_id = 0x18DA42F1 | CAN_EFF_FLAG;
+ addr.tp.rx_id = 0x18DAF142 | CAN_EFF_FLAG;
+
+ ret = bind(s, (struct sockaddr *)&addr, sizeof(addr));
+ if (ret < 0)
+ exit(1);
+
+ /* Data can now be received using read(s, ...) and sent using write(s, ...) */
+
+Additional examples
+-------------------
+
+More complete (and complex) examples can be found inside the ``isotp*`` userland
+tools, distributed as part of the ``can-utils`` utilities at:
+https://github.com/linux-can/can-utils
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
index 8496b467dea4..e8cf8b3e60ac 100644
--- a/Documentation/networking/l2tp.rst
+++ b/Documentation/networking/l2tp.rst
@@ -638,9 +638,8 @@ Tunnels are identified by a unique tunnel id. The id is 16-bit for
L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
value.
-Tunnels are kept in a per-net list, indexed by tunnel id. The tunnel
-id namespace is shared by L2TPv2 and L2TPv3. The tunnel context can be
-derived from the socket's sk_user_data.
+Tunnels are kept in a per-net list, indexed by tunnel id. The
+tunnel id namespace is shared by L2TPv2 and L2TPv3.
Handling tunnel socket close is perhaps the most tricky part of the
L2TP implementation. If userspace closes a tunnel socket, the L2TP
@@ -652,9 +651,7 @@ socket's encap_destroy handler is invoked, which L2TP uses to initiate
its tunnel close actions. For L2TPIP sockets, the socket's close
handler initiates the same tunnel close actions. All sessions are
first closed. Each session drops its tunnel ref. When the tunnel ref
-reaches zero, the tunnel puts its socket ref. When the socket is
-eventually destroyed, its sk_destruct finally frees the L2TP tunnel
-context.
+reaches zero, the tunnel drops its socket ref.
Sessions
--------
@@ -667,10 +664,7 @@ pseudowire) or other data types such as PPP, ATM, HDLC or Frame
Relay. Linux currently implements only Ethernet and PPP session types.
Some L2TP session types also have a socket (PPP pseudowires) while
-others do not (Ethernet pseudowires). We can't therefore use the
-socket reference count as the reference count for session
-contexts. The L2TP implementation therefore has its own internal
-reference counts on the session contexts.
+others do not (Ethernet pseudowires).
Like tunnels, L2TP sessions are identified by a unique
session id. Just as with tunnel ids, the session id is 16-bit for
@@ -680,21 +674,19 @@ value.
Sessions hold a ref on their parent tunnel to ensure that the tunnel
stays extant while one or more sessions references it.
-Sessions are kept in a per-tunnel list, indexed by session id. L2TPv3
-sessions are also kept in a per-net list indexed by session id,
-because L2TPv3 session ids are unique across all tunnels and L2TPv3
-data packets do not contain a tunnel id in the header. This list is
-therefore needed to find the session context associated with a
-received data packet when the tunnel context cannot be derived from
-the tunnel socket.
+Sessions are kept in a per-net list. L2TPv2 sessions and L2TPv3
+sessions are stored in separate lists. L2TPv2 sessions are keyed
+by a 32-bit key made up of the 16-bit tunnel ID and 16-bit
+session ID. L2TPv3 sessions are keyed by the 32-bit session ID, since
+L2TPv3 session ids are unique across all tunnels.
Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
-scoped by the tunnel, the kernel does not police this for L2TPv3 UDP
-tunnels and does not add sessions of L2TPv3 UDP tunnels into the
-per-net session list. In the UDP receive code, we must trust that the
-tunnel can be identified using the tunnel socket's sk_user_data and
-lookup the session in the tunnel's session list instead of the per-net
-session list.
+scoped by the tunnel, the Linux implementation has historically
+allowed this. Such session id collisions are supported using a per-net
+hash table keyed by sk and session ID. When looking up L2TPv3
+sessions, the list entry may link to multiple sessions with that
+session ID, in which case the session matching the given sk (tunnel)
+is used.
PPP
---
@@ -714,10 +706,9 @@ The L2TP PPP implementation handles the closing of a PPPoL2TP socket
by closing its corresponding L2TP session. This is complicated because
it must consider racing with netlink session create/destroy requests
and pppol2tp_connect trying to reconnect with a session that is in the
-process of being closed. Unlike tunnels, PPP sessions do not hold a
-ref on their associated socket, so code must be careful to sock_hold
-the socket where necessary. For all the details, see commit
-3d609342cc04129ff7568e19316ce3d7451a27e8.
+process of being closed. PPP sessions hold a ref on their associated
+socket in order that the socket remains extants while the session
+references it.
Ethernet
--------
@@ -761,15 +752,10 @@ Limitations
The current implementation has a number of limitations:
- 1) Multiple UDP sockets with the same 5-tuple address cannot be
- used. The kernel's tunnel context is identified using private
- data associated with the socket so it is important that each
- socket is uniquely identified by its address.
-
- 2) Interfacing with openvswitch is not yet implemented. It may be
+ 1) Interfacing with openvswitch is not yet implemented. It may be
useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
- 3) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+ 2) VLAN pseudowires are implemented using an ``l2tpethN`` interface
configured with a VLAN sub-interface. Since L2TPv3 VLAN
pseudowires carry one and only one VLAN, it may be better to use
a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 69975ce25a02..95598c21fc8e 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -7,14 +7,6 @@ MPTCP Sysfs variables
/proc/sys/net/mptcp/* Variables
===============================
-enabled - BOOLEAN
- Control whether MPTCP sockets can be created.
-
- MPTCP sockets can be created if the value is 1. This is a
- per-namespace sysctl.
-
- Default: 1 (enabled)
-
add_addr_timeout - INTEGER (seconds)
Set the timeout after which an ADD_ADDR control message will be
resent to an MPTCP peer that has not acknowledged a previous
@@ -25,16 +17,33 @@ add_addr_timeout - INTEGER (seconds)
Default: 120
-close_timeout - INTEGER (seconds)
- Set the make-after-break timeout: in absence of any close or
- shutdown syscall, MPTCP sockets will maintain the status
- unchanged for such time, after the last subflow removal, before
- moving to TCP_CLOSE.
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
- The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
- sysctl.
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
- Default: 60
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+available_schedulers - STRING
+ Shows the available schedulers choices that are registered. More packet
+ schedulers may be available, but not loaded.
+
+blackhole_timeout - INTEGER (seconds)
+ Initial time period in second to disable MPTCP on active MPTCP sockets
+ when a MPTCP firewall blackhole issue happens. This time period will
+ grow exponentially when more blackhole issues get detected right after
+ MPTCP is re-enabled and will reset to the initial value when the
+ blackhole issue goes away.
+
+ 0 to disable the blackhole detection.
+
+ Default: 3600
checksum_enabled - BOOLEAN
Control whether DSS checksum can be enabled.
@@ -44,18 +53,24 @@ checksum_enabled - BOOLEAN
Default: 0
-allow_join_initial_addr_port - BOOLEAN
- Allow peers to send join requests to the IP address and port number used
- by the initial subflow if the value is 1. This controls a flag that is
- sent to the peer at connection time, and whether such join requests are
- accepted or denied.
+close_timeout - INTEGER (seconds)
+ Set the make-after-break timeout: in absence of any close or
+ shutdown syscall, MPTCP sockets will maintain the status
+ unchanged for such time, after the last subflow removal, before
+ moving to TCP_CLOSE.
- Joins to addresses advertised with ADD_ADDR are not affected by this
- value.
+ The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+ sysctl.
- This is a per-namespace sysctl.
+ Default: 60
- Default: 1
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
pm_type - INTEGER
Set the default path manager type to use for each new MPTCP
@@ -74,6 +89,14 @@ pm_type - INTEGER
Default: 0
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
+
stale_loss_cnt - INTEGER
The number of MPTCP-level retransmission intervals with no traffic and
pending outstanding data on a given subflow required to declare it stale.
@@ -85,11 +108,3 @@ stale_loss_cnt - INTEGER
This is a per-namespace sysctl.
Default: 4
-
-scheduler - STRING
- Select the scheduler of your choice.
-
- Support for selection of different schedulers. This is a per-namespace
- sysctl.
-
- Default: "default"
diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst
new file mode 100644
index 000000000000..17f2bab61164
--- /dev/null
+++ b/Documentation/networking/mptcp.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multipath TCP (MPTCP)
+=====================
+
+Introduction
+============
+
+Multipath TCP or MPTCP is an extension to the standard TCP and is described in
+`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
+device to make use of multiple interfaces at once to send and receive TCP
+packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
+multiple interfaces or prefer the one with the lowest latency. It also allows a
+fail-over if one path is down, and the traffic is seamlessly reinjected on other
+paths.
+
+For more details about Multipath TCP in the Linux kernel, please see the
+official website: `mptcp.dev <https://www.mptcp.dev>`_.
+
+
+Use cases
+=========
+
+Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
+brings new use-cases, compared to TCP:
+
+- Seamless handovers: switching from one path to another while preserving
+ established connections, e.g. to be used in mobility use-cases, like on
+ smartphones.
+- Best network selection: using the "best" available path depending on some
+ conditions, e.g. latency, losses, cost, bandwidth, etc.
+- Network aggregation: using multiple paths at the same time to have a higher
+ throughput, e.g. to combine fixed and mobile networks to send files faster.
+
+
+Concepts
+========
+
+Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
+(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
+a regular TCP connection that is used to transmit data through one interface.
+Additional *subflows* can be negotiated later between the hosts. For the remote
+host to be able to detect the use of MPTCP, a new field is added to the TCP
+*option* field of the underlying TCP *subflow*. This field contains, amongst
+other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
+it is supported. If the remote host or any middlebox in between does not support
+it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
+*option* field. In that case, the connection will be "downgraded" to plain TCP,
+and it will continue with a single path.
+
+This behavior is made possible by two internal components: the path manager, and
+the packet scheduler.
+
+Path Manager
+------------
+
+The Path Manager is in charge of *subflows*, from creation to deletion, and also
+address announcements. Typically, it is the client side that initiates subflows,
+and the server side that announces additional addresses via the ``ADD_ADDR`` and
+``REMOVE_ADDR`` options.
+
+Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
+mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
+same rules are applied for all the connections (see: ``ip mptcp``) ; and the
+userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
+<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
+connection. The path managers can be controlled via a Netlink API; see
+netlink_spec/mptcp_pm.rst.
+
+To be able to use multiple IP addresses on a host to create multiple *subflows*
+(paths), the default in-kernel MPTCP path-manager needs to know which IP
+addresses can be used. This can be configured with ``ip mptcp endpoint`` for
+example.
+
+Packet Scheduler
+----------------
+
+The Packet Scheduler is in charge of selecting which available *subflow(s)* to
+use to send the next data packet. It can decide to maximize the use of the
+available bandwidth, only to pick the path with the lower latency, or any other
+policy depending on the configuration.
+
+Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
+see mptcp-sysctl.rst.
+
+
+Sockets API
+===========
+
+Creating MPTCP sockets
+----------------------
+
+On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
+``socket``:
+
+.. code-block:: C
+
+ int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
+
+Note that ``IPPROTO_MPTCP`` is defined as ``262``.
+
+If MPTCP is not supported, ``errno`` will be set to:
+
+- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
+- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
+ on kernels >= v5.6.
+- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
+ ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.
+
+MPTCP is then opt-in: applications need to explicitly request it. Note that
+applications can be forced to use MPTCP with different techniques, e.g.
+``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
+``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.
+
+Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
+transparent as possible for the userspace applications.
+
+Socket options
+--------------
+
+MPTCP supports most socket options handled by TCP. It is possible some less
+common options are not supported, but contributions are welcome.
+
+Generally, the same value is propagated to all subflows, including the ones
+created after the calls to ``setsockopt()``. eBPF can be used to set different
+values per subflow.
+
+There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
+retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
+call:
+
+- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
+- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
+ ``struct tcp_info``.
+- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
+ array of ``mptcp_subflow_addrs``.
+- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
+ array of ``struct mptcp_subflow_info`` (including the
+ ``struct mptcp_subflow_addrs``), and one pointer to an array of
+ ``struct tcp_info``, followed by the content of ``struct mptcp_info``.
+
+Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
+if MPTCP is currently being used: the value will be set to 1 if it is.
+
+
+Design choices
+==============
+
+A new socket type has been added for MPTCP for the userspace-facing socket. The
+kernel is in charge of creating subflow sockets: they are TCP sockets where the
+behavior is modified using TCP-ULP.
+
+MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
+connection request from the client didn't ask for MPTCP, making the performance
+impact minimal when MPTCP is enabled by default.
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
index 268819225866..2cd25d81aaa7 100644
--- a/Documentation/networking/multi-pf-netdev.rst
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -111,11 +111,11 @@ The relation between PF, irq, napi, and queue can be observed via netlink spec::
Here you can clearly observe our channels distribution policy::
$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
- /proc/irq/36/mlx5_comp1@pci:0000:08:00.0
- /proc/irq/39/mlx5_comp1@pci:0000:09:00.0
- /proc/irq/40/mlx5_comp2@pci:0000:08:00.0
- /proc/irq/41/mlx5_comp2@pci:0000:09:00.0
- /proc/irq/42/mlx5_comp3@pci:0000:08:00.0
+ /proc/irq/36/mlx5_comp0@pci:0000:08:00.0
+ /proc/irq/39/mlx5_comp0@pci:0000:09:00.0
+ /proc/irq/40/mlx5_comp1@pci:0000:08:00.0
+ /proc/irq/41/mlx5_comp1@pci:0000:09:00.0
+ /proc/irq/42/mlx5_comp2@pci:0000:08:00.0
Steering
========
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 7bf7b95c4f7a..dfa5d549be9c 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -144,9 +144,8 @@ IRQ should only be unmasked after a successful call to napi_complete_done():
napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
of guarantees given by being invoked in IRQ context (no need to
-mask interrupts). Note that PREEMPT_RT forces all interrupts
-to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD``
-to avoid issues on real-time kernel configurations.
+mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
+IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
Instance to queue mapping
-------------------------
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 70c4fb9d4e5c..22b07c814f4a 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -7,6 +7,8 @@ net_device struct fast path usage breakdown
Type Name fastpath_tx_access fastpath_rx_access Comments
..struct ..net_device
+unsigned_long:32 priv_flags read_mostly - __dev_queue_xmit(tx)
+unsigned_long:1 lltx read_mostly - HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
char name[16] - -
struct_netdev_name_node* name_node
struct_dev_ifalias* ifalias
@@ -23,7 +25,6 @@ struct_list_head ptype_specific
struct adj_list
unsigned_int flags read_mostly read_mostly __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
xdp_features_t xdp_features
-unsigned_long_long priv_flags read_mostly - __dev_queue_xmit(tx)
struct_net_device_ops* netdev_ops read_mostly - netdev_core_pick_tx,netdev_start_xmit(tx)
struct_xdp_metadata_ops* xdp_metadata_ops
int ifindex - read_mostly ip6_rcv_core
@@ -98,7 +99,7 @@ unsigned_int num_rx_queues
unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
unsigned_long gro_flush_timeout - read_mostly napi_complete_done
-int napi_defer_hard_irqs - read_mostly napi_complete_done
+u32 napi_defer_hard_irqs - read_mostly napi_complete_done
unsigned_int gro_max_size - read_mostly skb_gro_receive
unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
@@ -163,6 +164,10 @@ struct_lock_class_key* qdisc_tx_busylock
bool proto_down
unsigned:1 wol_enabled
unsigned:1 threaded - - napi_poll(napi_enable,dev_set_threaded)
+unsigned_long:1 see_all_hwtstamp_requests
+unsigned_long:1 change_proto_down
+unsigned_long:1 netns_local
+unsigned_long:1 fcoe_mtu
struct_list_head net_notifier_list
struct_macsec_ops* macsec_ops
struct_udp_tunnel_nic_info* udp_tunnel_nic_info
@@ -176,3 +181,5 @@ netdevice_tracker dev_registered_tracker
struct_rtnl_hw_stats64* offload_xstats_l3
struct_devlink_port* devlink_port
struct_dpll_pin* dpll_pin
+struct hlist_head page_pools
+struct dim_irq_moder* irq_moder
diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst
index 3bed9fd95336..8908fd7b0a8d 100644
--- a/Documentation/networking/net_dim.rst
+++ b/Documentation/networking/net_dim.rst
@@ -169,6 +169,48 @@ usage is not complete but it should make the outline of the usage clear.
...
}
+
+Tuning DIM
+==========
+
+Net DIM serves a range of network devices and delivers excellent acceleration
+benefits. Yet, it has been observed that some preset configurations of DIM may
+not align seamlessly with the varying specifications of network devices, and
+this discrepancy has been identified as a factor to the suboptimal performance
+outcomes of DIM-enabled network devices, related to a mismatch in profiles.
+
+To address this issue, Net DIM introduces a per-device control to modify and
+access a device's ``rx-profile`` and ``tx-profile`` parameters:
+Assume that the target network device is named ethx, and ethx only declares
+support for RX profile setting and supports modification of ``usec`` field
+and ``pkts`` field (See the data structure:
+:c:type:`struct dim_cq_moder <dim_cq_moder>`).
+
+You can use ethtool to modify the current RX DIM profile where all
+values are 64::
+
+ $ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n
+
+``n`` means do not modify this field, and ``_`` separates structure
+elements of the profile array.
+
+Querying the current profiles using::
+
+ $ ethtool -c ethx
+ ...
+ rx-profile:
+ {.usec = 1, .pkts = 1, .comps = n/a,},
+ {.usec = 2, .pkts = 2, .comps = n/a,},
+ {.usec = 3, .pkts = 64, .comps = n/a,},
+ {.usec = 64, .pkts = 4, .comps = n/a,},
+ {.usec = 64, .pkts = 64, .comps = n/a,}
+ tx-profile: n/a
+
+If the network device does not support specific fields of DIM profiles,
+the corresponding ``n/a`` will display. If the ``n/a`` field is being
+modified, error messages will be reported.
+
+
Dynamic Interrupt Moderation (DIM) library API
==============================================
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
index d7b15bb64deb..5014f7cc1398 100644
--- a/Documentation/networking/netdev-features.rst
+++ b/Documentation/networking/netdev-features.rst
@@ -139,21 +139,6 @@ chained skbs (skb->next/prev list).
Features contained in NETIF_F_SOFT_FEATURES are features of networking
stack. Driver should not change behaviour based on them.
- * LLTX driver (deprecated for hardware drivers)
-
-NETIF_F_LLTX is meant to be used by drivers that don't need locking at all,
-e.g. software tunnels.
-
-This is also used in a few legacy drivers that implement their
-own locking, don't use it for new (hardware) drivers.
-
- * netns-local device
-
-NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between
-network namespaces (e.g. loopback).
-
-Don't use it in drivers.
-
* VLAN challenged
NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index c2476917a6c3..857c9784f87e 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -258,11 +258,11 @@ ndo_get_stats:
ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
- When the driver sets NETIF_F_LLTX in dev->features this will be
+ When the driver sets dev->lltx this will be
called without holding netif_tx_lock. In this case the driver
has to lock by itself when needed.
The locking there should also properly protect against
- set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
+ set_rx_mode. WARNING: use of dev->lltx is deprecated.
Don't use it for new drivers.
Context: Process with BHs disabled or BH (timer),
diff --git a/Documentation/networking/oa-tc6-framework.rst b/Documentation/networking/oa-tc6-framework.rst
new file mode 100644
index 000000000000..fe2aabde923a
--- /dev/null
+++ b/Documentation/networking/oa-tc6-framework.rst
@@ -0,0 +1,497 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================================================================
+OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6) Framework Support
+=========================================================================
+
+Introduction
+------------
+
+The IEEE 802.3cg project defines two 10 Mbit/s PHYs operating over a
+single pair of conductors. The 10BASE-T1L (Clause 146) is a long reach
+PHY supporting full duplex point-to-point operation over 1 km of single
+balanced pair of conductors. The 10BASE-T1S (Clause 147) is a short reach
+PHY supporting full / half duplex point-to-point operation over 15 m of
+single balanced pair of conductors, or half duplex multidrop bus
+operation over 25 m of single balanced pair of conductors.
+
+Furthermore, the IEEE 802.3cg project defines the new Physical Layer
+Collision Avoidance (PLCA) Reconciliation Sublayer (Clause 148) meant to
+provide improved determinism to the CSMA/CD media access method. PLCA
+works in conjunction with the 10BASE-T1S PHY operating in multidrop mode.
+
+The aforementioned PHYs are intended to cover the low-speed / low-cost
+applications in industrial and automotive environment. The large number
+of pins (16) required by the MII interface, which is specified by the
+IEEE 802.3 in Clause 22, is one of the major cost factors that need to be
+addressed to fulfil this objective.
+
+The MAC-PHY solution integrates an IEEE Clause 4 MAC and a 10BASE-T1x PHY
+exposing a low pin count Serial Peripheral Interface (SPI) to the host
+microcontroller. This also enables the addition of Ethernet functionality
+to existing low-end microcontrollers which do not integrate a MAC
+controller.
+
+Overview
+--------
+
+The MAC-PHY is specified to carry both data (Ethernet frames) and control
+(register access) transactions over a single full-duplex serial peripheral
+interface.
+
+Protocol Overview
+-----------------
+
+Two types of transactions are defined in the protocol: data transactions
+for Ethernet frame transfers and control transactions for register
+read/write transfers. A chunk is the basic element of data transactions
+and is composed of 4 bytes of overhead plus 64 bytes of payload size for
+each chunk. Ethernet frames are transferred over one or more data chunks.
+Control transactions consist of one or more register read/write control
+commands.
+
+SPI transactions are initiated by the SPI host with the assertion of CSn
+low to the MAC-PHY and ends with the deassertion of CSn high. In between
+each SPI transaction, the SPI host may need time for additional
+processing and to setup the next SPI data or control transaction.
+
+SPI data transactions consist of an equal number of transmit (TX) and
+receive (RX) chunks. Chunks in both transmit and receive directions may
+or may not contain valid frame data independent from each other, allowing
+for the simultaneous transmission and reception of different length
+frames.
+
+Each transmit data chunk begins with a 32-bit data header followed by a
+data chunk payload on MOSI. The data header indicates whether transmit
+frame data is present and provides the information to determine which
+bytes of the payload contain valid frame data.
+
+In parallel, receive data chunks are received on MISO. Each receive data
+chunk consists of a data chunk payload ending with a 32-bit data footer.
+The data footer indicates if there is receive frame data present within
+the payload or not and provides the information to determine which bytes
+of the payload contain valid frame data.
+
+Reference
+---------
+
+10BASE-T1x MAC-PHY Serial Interface Specification,
+
+Link: https://opensig.org/download/document/OPEN_Alliance_10BASET1x_MAC-PHY_Serial_Interface_V1.1.pdf
+
+Hardware Architecture
+---------------------
+
+.. code-block:: none
+
+ +----------+ +-------------------------------------+
+ | | | MAC-PHY |
+ | |<---->| +-----------+ +-------+ +-------+ |
+ | SPI Host | | | SPI Slave | | MAC | | PHY | |
+ | | | +-----------+ +-------+ +-------+ |
+ +----------+ +-------------------------------------+
+
+Software Architecture
+---------------------
+
+.. code-block:: none
+
+ +----------------------------------------------------------+
+ | Networking Subsystem |
+ +----------------------------------------------------------+
+ / \ / \
+ | |
+ | |
+ \ / |
+ +----------------------+ +-----------------------------+
+ | MAC Driver |<--->| OPEN Alliance TC6 Framework |
+ +----------------------+ +-----------------------------+
+ / \ / \
+ | |
+ | |
+ | \ /
+ +----------------------------------------------------------+
+ | SPI Subsystem |
+ +----------------------------------------------------------+
+ / \
+ |
+ |
+ \ /
+ +----------------------------------------------------------+
+ | 10BASE-T1x MAC-PHY Device |
+ +----------------------------------------------------------+
+
+Implementation
+--------------
+
+MAC Driver
+~~~~~~~~~~
+
+- Probed by SPI subsystem.
+
+- Initializes OA TC6 framework for the MAC-PHY.
+
+- Registers and configures the network device.
+
+- Sends the tx ethernet frames from n/w subsystem to OA TC6 framework.
+
+OPEN Alliance TC6 Framework
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Initializes PHYLIB interface.
+
+- Registers mac-phy interrupt.
+
+- Performs mac-phy register read/write operation using the control
+ transaction protocol specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+ Serial Interface specification.
+
+- Performs Ethernet frames transaction using the data transaction protocol
+ for Ethernet frames specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+ Serial Interface specification.
+
+- Forwards the received Ethernet frame from 10Base-T1x MAC-PHY to n/w
+ subsystem.
+
+Data Transaction
+~~~~~~~~~~~~~~~~
+
+The Ethernet frames that are typically transferred from the SPI host to
+the MAC-PHY will be converted into multiple transmit data chunks. Each
+transmit data chunk will have a 4 bytes header which contains the
+information needed to determine the validity and the location of the
+transmit frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+ +---------------------------------------------------+
+ | Tx Chunk |
+ | +---------------------------+ +----------------+ | MOSI
+ | | 64 bytes chunk payload | | 4 bytes header | |------------>
+ | +---------------------------+ +----------------+ |
+ +---------------------------------------------------+
+
+4 bytes header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+ transaction. For TX data chunks, this bit shall be ’1’.
+ 0 - Control command
+ 1 - Data chunk
+
+SEQ (Bit 30) - Data Chunk Sequence. This bit is used to indicate an
+ even/odd transmit data chunk sequence to the MAC-PHY.
+
+NORX (Bit 29) - No Receive flag. The SPI host may set this bit to prevent
+ the MAC-PHY from conveying RX data on the MISO for the
+ current chunk (DV = 0 in the footer), indicating that the
+ host would not process it. Typically, the SPI host should
+ set NORX = 0 indicating that it will accept and process
+ any receive frame data within the current chunk.
+
+RSVD (Bit 28..24) - Reserved: All reserved bits shall be ‘0’.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+ If the MAC-PHY does not implement these bits, the host
+ shall set them to ‘0’.
+
+DV (Bit 21) - Data Valid flag. The SPI host uses this bit to indicate
+ whether the current chunk contains valid transmit frame data
+ (DV = 1) or not (DV = 0). When ‘0’, the MAC-PHY ignores the
+ chunk payload. Note that the receive path is unaffected by
+ the setting of the DV bit in the data header.
+
+SV (Bit 20) - Start Valid flag. The SPI host shall set this bit when the
+ beginning of an Ethernet frame is present in the current
+ transmit data chunk payload. Otherwise, this bit shall be
+ zero. This bit is not to be confused with the Start-of-Frame
+ Delimiter (SFD) byte described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field shall
+ contain the 32-bit word offset into the transmit data
+ chunk payload that points to the start of a new
+ Ethernet frame to be transmitted. The host shall write
+ this field as zero when SV = 0.
+
+RSVD (Bit 15) - Reserved: All reserved bits shall be ‘0’.
+
+EV (Bit 14) - End Valid flag. The SPI host shall set this bit when the end
+ of an Ethernet frame is present in the current transmit data
+ chunk payload. Otherwise, this bit shall be zero.
+
+EBO (Bit 13..8) - End Byte Offset. When EV = 1, this field shall contain
+ the byte offset into the transmit data chunk payload
+ that points to the last byte of the Ethernet frame to
+ transmit. This field shall be zero when EV = 0.
+
+TSC (Bit 7..6) - Timestamp Capture. Request a timestamp capture when the
+ frame is transmitted onto the network.
+ 00 - Do not capture a timestamp
+ 01 - Capture timestamp into timestamp capture register A
+ 10 - Capture timestamp into timestamp capture register B
+ 11 - Capture timestamp into timestamp capture register C
+
+RSVD (Bit 5..1) - Reserved: All reserved bits shall be ‘0’.
+
+P (Bit 0) - Parity. Parity bit calculated over the transmit data header.
+ Method used is odd parity.
+
+The number of buffers available in the MAC-PHY to store the incoming
+transmit data chunk payloads is represented as transmit credits. The
+available transmit credits in the MAC-PHY can be read either from the
+Buffer Status Register or footer (Refer below for the footer info)
+received from the MAC-PHY. The SPI host should not write more data chunks
+than the available transmit credits as this will lead to transmit buffer
+overflow error.
+
+In case the previous data footer had no transmit credits available and
+once the transmit credits become available for transmitting transmit data
+chunks, the MAC-PHY interrupt is asserted to SPI host. On reception of the
+first data header this interrupt will be deasserted and the received
+footer for the first data chunk will have the transmit credits available
+information.
+
+The Ethernet frames that are typically transferred from MAC-PHY to SPI
+host will be sent as multiple receive data chunks. Each receive data
+chunk will have 64 bytes of data chunk payload followed by 4 bytes footer
+which contains the information needed to determine the validity and the
+location of the receive frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+ +---------------------------------------------------+
+ | Rx Chunk |
+ | +----------------+ +---------------------------+ | MISO
+ | | 4 bytes footer | | 64 bytes chunk payload | |------------>
+ | +----------------+ +---------------------------+ |
+ +---------------------------------------------------+
+
+4 bytes footer contains the below fields,
+
+EXST (Bit 31) - Extended Status. This bit is set when any bit in the
+ STATUS0 or STATUS1 registers are set and not masked.
+
+HDRB (Bit 30) - Received Header Bad. When set, indicates that the MAC-PHY
+ received a control or data header with a parity error.
+
+SYNC (Bit 29) - Configuration Synchronized flag. This bit reflects the
+ state of the SYNC bit in the CONFIG0 configuration
+ register (see Table 12). A zero indicates that the MAC-PHY
+ configuration may not be as expected by the SPI host.
+ Following configuration, the SPI host sets the
+ corresponding bitin the configuration register which is
+ reflected in this field.
+
+RCA (Bit 28..24) - Receive Chunks Available. The RCA field indicates to
+ the SPI host the minimum number of additional receive
+ data chunks of frame data that are available for
+ reading beyond the current receive data chunk. This
+ field is zero when there is no receive frame data
+ pending in the MAC-PHY’s buffer for reading.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+ If not implemented, the MAC-PHY shall set these bits to
+ ‘0’.
+
+DV (Bit 21) - Data Valid flag. The MAC-PHY uses this bit to indicate
+ whether the current receive data chunk contains valid
+ receive frame data (DV = 1) or not (DV = 0). When ‘0’, the
+ SPI host shall ignore the chunk payload.
+
+SV (Bit 20) - Start Valid flag. The MAC-PHY sets this bit when the current
+ chunk payload contains the start of an Ethernet frame.
+ Otherwise, this bit is zero. The SV bit is not to be
+ confused with the Start-of-Frame Delimiter (SFD) byte
+ described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field contains the
+ 32-bit word offset into the receive data chunk payload
+ containing the first byte of a new received Ethernet
+ frame. When a receive timestamp has been added to the
+ beginning of the received Ethernet frame (RTSA = 1)
+ then SWO points to the most significant byte of the
+ timestamp. This field will be zero when SV = 0.
+
+FD (Bit 15) - Frame Drop. When set, this bit indicates that the MAC has
+ detected a condition for which the SPI host should drop the
+ received Ethernet frame. This bit is only valid at the end
+ of a received Ethernet frame (EV = 1) and shall be zero at
+ all other times.
+
+EV (Bit 14) - End Valid flag. The MAC-PHY sets this bit when the end of a
+ received Ethernet frame is present in this receive data
+ chunk payload.
+
+EBO (Bit 13..8) - End Byte Offset: When EV = 1, this field contains the
+ byte offset into the receive data chunk payload that
+ locates the last byte of the received Ethernet frame.
+ This field is zero when EV = 0.
+
+RTSA (Bit 7) - Receive Timestamp Added. This bit is set when a 32-bit or
+ 64-bit timestamp has been added to the beginning of the
+ received Ethernet frame. The MAC-PHY shall set this bit to
+ zero when SV = 0.
+
+RTSP (Bit 6) - Receive Timestamp Parity. Parity bit calculated over the
+ 32-bit/64-bit timestamp added to the beginning of the
+ received Ethernet frame. Method used is odd parity. The
+ MAC-PHY shall set this bit to zero when RTSA = 0.
+
+TXC (Bit 5..1) - Transmit Credits. This field contains the minimum number
+ of transmit data chunks of frame data that the SPI host
+ can write in a single transaction without incurring a
+ transmit buffer overflow error.
+
+P (Bit 0) - Parity. Parity bit calculated over the receive data footer.
+ Method used is odd parity.
+
+SPI host will initiate the data receive transaction based on the receive
+chunks available in the MAC-PHY which is provided in the receive chunk
+footer (RCA - Receive Chunks Available). SPI host will create data invalid
+transmit data chunks (empty chunks) or data valid transmit data chunks in
+case there are valid Ethernet frames to transmit to the MAC-PHY. The
+receive chunks available in MAC-PHY can be read either from the Buffer
+Status Register or footer.
+
+In case the previous data footer had no receive data chunks available and
+once the receive data chunks become available again for reading, the
+MAC-PHY interrupt is asserted to SPI host. On reception of the first data
+header this interrupt will be deasserted and the received footer for the
+first data chunk will have the receive chunks available information.
+
+MAC-PHY Interrupt
+~~~~~~~~~~~~~~~~~
+
+The MAC-PHY interrupt is asserted when the following conditions are met.
+
+Receive chunks available - This interrupt is asserted when the previous
+data footer had no receive data chunks available and once the receive
+data chunks become available for reading. On reception of the first data
+header this interrupt will be deasserted.
+
+Transmit chunk credits available - This interrupt is asserted when the
+previous data footer indicated no transmit credits available and once the
+transmit credits become available for transmitting transmit data chunks.
+On reception of the first data header this interrupt will be deasserted.
+
+Extended status event - This interrupt is asserted when the previous data
+footer indicated no extended status and once the extended event become
+available. In this case the host should read status #0 register to know
+the corresponding error/event. On reception of the first data header this
+interrupt will be deasserted.
+
+Control Transaction
+~~~~~~~~~~~~~~~~~~~
+
+4 bytes control header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+ transaction. For control commands, this bit shall be ‘0’.
+ 0 - Control command
+ 1 - Data chunk
+
+HDRB (Bit 30) - Received Header Bad. When set by the MAC-PHY, indicates
+ that a header was received with a parity error. The SPI
+ host should always clear this bit. The MAC-PHY ignores the
+ HDRB value sent by the SPI host on MOSI.
+
+WNR (Bit 29) - Write-Not-Read. This bit indicates if data is to be written
+ to registers (when set) or read from registers
+ (when clear).
+
+AID (Bit 28) - Address Increment Disable. When clear, the address will be
+ automatically post-incremented by one following each
+ register read or write. When set, address auto increment is
+ disabled allowing successive reads and writes to occur at
+ the same register address.
+
+MMS (Bit 27..24) - Memory Map Selector. This field selects the specific
+ register memory map to access.
+
+ADDR (Bit 23..8) - Address. Address of the first register within the
+ selected memory map to access.
+
+LEN (Bit 7..1) - Length. Specifies the number of registers to read/write.
+ This field is interpreted as the number of registers
+ minus 1 allowing for up to 128 consecutive registers read
+ or written starting at the address specified in ADDR. A
+ length of zero shall read or write a single register.
+
+P (Bit 0) - Parity. Parity bit calculated over the control command header.
+ Method used is odd parity.
+
+Control transactions consist of one or more control commands. Control
+commands are used by the SPI host to read and write registers within the
+MAC-PHY. Each control commands are composed of a 4 bytes control command
+header followed by register write data in case of control write command.
+
+The MAC-PHY ignores the final 4 bytes of data from the SPI host at the end
+of the control write command. The control write command is also echoed
+from the MAC-PHY back to the SPI host to identify which register write
+failed in case of any bus errors. The echoed Control write command will
+have the first 4 bytes unused value to be ignored by the SPI host
+followed by 4 bytes echoed control header followed by echoed register
+write data. Control write commands can write either a single register or
+multiple consecutive registers. When multiple consecutive registers are
+written, the address is automatically post-incremented by the MAC-PHY.
+Writing to any unimplemented or undefined registers shall be ignored and
+yield no effect.
+
+The MAC-PHY ignores all data from the SPI host following the control
+header for the remainder of the control read command. The control read
+command is also echoed from the MAC-PHY back to the SPI host to identify
+which register read is failed in case of any bus errors. The echoed
+Control read command will have the first 4 bytes of unused value to be
+ignored by the SPI host followed by 4 bytes echoed control header followed
+by register read data. Control read commands can read either a single
+register or multiple consecutive registers. When multiple consecutive
+registers are read, the address is automatically post-incremented by the
+MAC-PHY. Reading any unimplemented or undefined registers shall return
+zero.
+
+Device drivers API
+==================
+
+The include/linux/oa_tc6.h defines the following functions:
+
+.. c:function:: struct oa_tc6 *oa_tc6_init(struct spi_device *spi, \
+ struct net_device *netdev)
+
+Initialize OA TC6 lib.
+
+.. c:function:: void oa_tc6_exit(struct oa_tc6 *tc6)
+
+Free allocated OA TC6 lib.
+
+.. c:function:: int oa_tc6_write_register(struct oa_tc6 *tc6, u32 address, \
+ u32 value)
+
+Write a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_write_registers(struct oa_tc6 *tc6, u32 address, \
+ u32 value[], u8 length)
+
+Writing multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be written starting at @address.
+
+.. c:function:: int oa_tc6_read_register(struct oa_tc6 *tc6, u32 address, \
+ u32 *value)
+
+Read a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_read_registers(struct oa_tc6 *tc6, u32 address, \
+ u32 value[], u8 length)
+
+Reading multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be read starting at @address.
+
+.. c:function:: netdev_tx_t oa_tc6_start_xmit(struct oa_tc6 *tc6, \
+ struct sk_buff *skb);
+
+The transmit Ethernet frame in the skb is or going to be transmitted through
+the MAC-PHY.
+
+.. c:function:: int oa_tc6_zero_align_receive_frame_enable(struct oa_tc6 *tc6);
+
+Zero align receive frame feature can be enabled to align all receive ethernet
+frames data to start at the beginning of any receive data chunk payload with a
+start word offset (SWO) of zero.
diff --git a/Documentation/networking/phy-link-topology.rst b/Documentation/networking/phy-link-topology.rst
new file mode 100644
index 000000000000..4dec5d7d6513
--- /dev/null
+++ b/Documentation/networking/phy-link-topology.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _phy_link_topology:
+
+=================
+PHY link topology
+=================
+
+Overview
+========
+
+The PHY link topology representation in the networking stack aims at representing
+the hardware layout for any given Ethernet link.
+
+An Ethernet interface from userspace's point of view is nothing but a
+:c:type:`struct net_device <net_device>`, which exposes configuration options
+through the legacy ioctls and the ethtool netlink commands. The base assumption
+when designing these configuration APIs were that the link looks something like ::
+
+ +-----------------------+ +----------+ +--------------+
+ | Ethernet Controller / | | Ethernet | | Connector / |
+ | MAC | ------ | PHY | ---- | Port | ---... to LP
+ +-----------------------+ +----------+ +--------------+
+ struct net_device struct phy_device
+
+Commands that needs to configure the PHY will go through the net_device.phydev
+field to reach the PHY and perform the relevant configuration.
+
+This assumption falls apart in more complex topologies that can arise when,
+for example, using SFP transceivers (although that's not the only specific case).
+
+Here, we have 2 basic scenarios. Either the MAC is able to output a serialized
+interface, that can directly be fed to an SFP cage, such as SGMII, 1000BaseX,
+10GBaseR, etc.
+
+The link topology then looks like this (when an SFP module is inserted) ::
+
+ +-----+ SGMII +------------+
+ | MAC | ------- | SFP Module |
+ +-----+ +------------+
+
+Knowing that some modules embed a PHY, the actual link is more like ::
+
+ +-----+ SGMII +--------------+
+ | MAC | -------- | PHY (on SFP) |
+ +-----+ +--------------+
+
+In this case, the SFP PHY is handled by phylib, and registered by phylink through
+its SFP upstream ops.
+
+Now some Ethernet controllers aren't able to output a serialized interface, so
+we can't directly connect them to an SFP cage. However, some PHYs can be used
+as media-converters, to translate the non-serialized MAC MII interface to a
+serialized MII interface fed to the SFP ::
+
+ +-----+ RGMII +-----------------------+ SGMII +--------------+
+ | MAC | ------- | PHY (media converter) | ------- | PHY (on SFP) |
+ +-----+ +-----------------------+ +--------------+
+
+This is where the model of having a single net_device.phydev pointer shows its
+limitations, as we now have 2 PHYs on the link.
+
+The phy_link topology framework aims at providing a way to keep track of every
+PHY on the link, for use by both kernel drivers and subsystems, but also to
+report the topology to userspace, allowing to target individual PHYs in configuration
+commands.
+
+API
+===
+
+The :c:type:`struct phy_link_topology <phy_link_topology>` is a per-netdevice
+resource, that gets initialized at netdevice creation. Once it's initialized,
+it is then possible to register PHYs to the topology through :
+
+:c:func:`phy_link_topo_add_phy`
+
+Besides registering the PHY to the topology, this call will also assign a unique
+index to the PHY, which can then be reported to userspace to refer to this PHY
+(akin to the ifindex). This index is a u32, ranging from 1 to U32_MAX. The value
+0 is reserved to indicate the PHY doesn't belong to any topology yet.
+
+The PHY can then be removed from the topology through
+
+:c:func:`phy_link_topo_del_phy`
+
+These function are already hooked into the phylib subsystem, so all PHYs that
+are linked to a net_device through :c:func:`phy_attach_direct` will automatically
+join the netdev's topology.
+
+PHYs that are on a SFP module will also be automatically registered IF the SFP
+upstream is phylink (so, no media-converter).
+
+PHY drivers that can be used as SFP upstream need to call :c:func:`phy_sfp_attach_phy`
+and :c:func:`phy_sfp_detach_phy`, which can be used as a
+.attach_phy / .detach_phy implementation for the
+:c:type:`struct sfp_upstream_ops <sfp_upstream_ops>`.
+
+UAPI
+====
+
+There exist a set of netlink commands to query the link topology from userspace,
+see ``Documentation/networking/ethtool-netlink.rst``.
+
+The whole point of having a topology representation is to assign the phyindex
+field in :c:type:`struct phy_device <phy_device>`. This index is reported to
+userspace using the ``ETHTOOL_MSG_PHY_GET`` ethtnl command. Performing a DUMP operation
+will result in all PHYs from all net_device being listed. The DUMP command
+accepts either a ``ETHTOOL_A_HEADER_DEV_INDEX`` or ``ETHTOOL_A_HEADER_DEV_NAME``
+to be passed in the request to filter the DUMP to a single net_device.
+
+The retrieved index can then be passed as a request parameter using the
+``ETHTOOL_A_HEADER_PHY_INDEX`` field in the following ethnl commands :
+
+* ``ETHTOOL_MSG_STRSET_GET`` to get the stats string set from a given PHY
+* ``ETHTOOL_MSG_CABLE_TEST_ACT`` and ``ETHTOOL_MSG_CABLE_TEST_ACT``, to perform
+ cable testing on a given PHY on the link (most likely the outermost PHY)
+* ``ETHTOOL_MSG_PSE_SET`` and ``ETHTOOL_MSG_PSE_GET`` for PHY-controlled PoE and PSE settings
+* ``ETHTOOL_MSG_PLCA_GET_CFG``, ``ETHTOOL_MSG_PLCA_SET_CFG`` and ``ETHTOOL_MSG_PLCA_GET_STATUS``
+ to set the PLCA (Physical Layer Collision Avoidance) parameters
+
+Note that the PHY index can be passed to other requests, which will silently
+ignore it if present and irrelevant.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 1283240d7620..f64641417c54 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -327,6 +327,12 @@ Some of the interface modes are described below:
This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
SGMII lines into a single link compared to 4 on QSGMII.
+``PHY_INTERFACE_MODE_10G_QXGMII``
+ Represents the 10G-QXGMII PHY-MAC interface as defined by the Cisco USXGMII
+ Multiport Copper Interface document. It supports 4 ports over a 10.3125 GHz
+ SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved
+ through symbol replication. The PCS expects the standard USXGMII code word.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/sriov.rst b/Documentation/networking/sriov.rst
new file mode 100644
index 000000000000..5deb4ff3154f
--- /dev/null
+++ b/Documentation/networking/sriov.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NIC SR-IOV APIs
+===============
+
+Modern NICs are strongly encouraged to focus on implementing the ``switchdev``
+model (see :ref:`switchdev`) to configure forwarding and security of SR-IOV
+functionality.
+
+Legacy API
+==========
+
+The old SR-IOV API is implemented in ``rtnetlink`` Netlink family as part of
+the ``RTM_GETLINK`` and ``RTM_SETLINK`` commands. On the driver side
+it consists of a number of ``ndo_set_vf_*`` and ``ndo_get_vf_*`` callbacks.
+
+Since the legacy APIs do not integrate well with the rest of the stack
+the API is considered frozen; no new functionality or extensions
+will be accepted. New drivers should not implement the uncommon callbacks;
+namely the following callbacks are off limits:
+
+ - ``ndo_get_vf_port``
+ - ``ndo_set_vf_port``
+ - ``ndo_set_vf_rss_query_en``
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index 758f1dae3fce..f355f0166f1b 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -137,10 +137,10 @@ would be sub-port 0 on port 1 on switch 1.
Port Features
^^^^^^^^^^^^^
-NETIF_F_NETNS_LOCAL
+dev->netns_local
If the switchdev driver (and device) only supports offloading of the default
-network namespace (netns), the driver should set this feature flag to prevent
+network namespace (netns), the driver should set this private flag to prevent
the port netdev from being moved out of the default netns. A netns-aware
driver/device would not set this flag and be responsible for partitioning
hardware to preserve netns containment. This means hardware cannot forward
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
index 8a58321acce7..e96e62d1dab3 100644
--- a/Documentation/networking/tcp_ao.rst
+++ b/Documentation/networking/tcp_ao.rst
@@ -337,6 +337,15 @@ TCP-AO per-socket counters are also duplicated with per-netns counters,
exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``,
``TCPAORequired`` and ``TCPAODroppedIcmps``.
+For monitoring purposes, there are following TCP-AO trace events:
+``tcp_hash_bad_header``, ``tcp_hash_ao_required``, ``tcp_ao_handshake_failure``,
+``tcp_ao_wrong_maclen``, ``tcp_ao_wrong_maclen``, ``tcp_ao_key_not_found``,
+``tcp_ao_rnext_request``, ``tcp_ao_synack_no_key``, ``tcp_ao_snd_sne_update``,
+``tcp_ao_rcv_sne_update``. It's possible to separately enable any of them and
+one can filter them by net-namespace, 4-tuple, family, L3 index, and TCP header
+flags. If a segment has a TCP-AO header, the filters may also include
+keyid, rnext, and maclen. SNE updates include the rolled-over numbers.
+
RFC 5925 very permissively specifies how TCP port matching can be done for
MKTs::
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index 5e93cd71f99f..8199e6917671 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -158,7 +158,8 @@ SOF_TIMESTAMPING_SYS_HARDWARE:
SOF_TIMESTAMPING_RAW_HARDWARE:
Report hardware timestamps as generated by
- SOF_TIMESTAMPING_TX_HARDWARE when available.
+ SOF_TIMESTAMPING_TX_HARDWARE or SOF_TIMESTAMPING_RX_HARDWARE
+ when available.
1.3.3 Timestamp Options
@@ -266,6 +267,23 @@ SOF_TIMESTAMPING_OPT_TX_SWHW:
two separate messages will be looped to the socket's error queue,
each containing just one timestamp.
+SOF_TIMESTAMPING_OPT_RX_FILTER:
+ Filter out spurious receive timestamps: report a receive timestamp
+ only if the matching timestamp generation flag is enabled.
+
+ Receive timestamps are generated early in the ingress path, before a
+ packet's destination socket is known. If any socket enables receive
+ timestamps, packets for all socket will receive timestamped packets.
+ Including those that request timestamp reporting with
+ SOF_TIMESTAMPING_SOFTWARE and/or SOF_TIMESTAMPING_RAW_HARDWARE, but
+ do not request receive timestamp generation. This can happen when
+ requesting transmit timestamps only.
+
+ Receiving spurious timestamps is generally benign. A process can
+ ignore the unexpected non-zero value. But it makes behavior subtly
+ dependent on other sockets. This flag isolates the socket for more
+ deterministic behavior.
+
New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
regardless of the setting of sysctl net.core.tstamp_allow_data.
diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst
index 00dc3a1a66b4..7f7c1ff6f159 100644
--- a/Documentation/networking/tproxy.rst
+++ b/Documentation/networking/tproxy.rst
@@ -17,7 +17,7 @@ The idea is that you identify packets with destination address matching a local
socket on your box, set the packet mark to a certain value::
# iptables -t mangle -N DIVERT
- # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+ # iptables -t mangle -A PREROUTING -p tcp -m socket --transparent -j DIVERT
# iptables -t mangle -A DIVERT -j MARK --set-mark 1
# iptables -t mangle -A DIVERT -j ACCEPT
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index bd033fe95cca..e76b0cfc32f7 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -11,12 +11,16 @@ metadata on the receive side.
General Design
==============
-The headroom for the metadata is reserved via ``tx_metadata_len`` in
-``struct xdp_umem_reg``. The metadata length is therefore the same for
-every socket that shares the same umem. The metadata layout is a fixed UAPI,
-refer to ``union xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
-Thus, generally, the ``tx_metadata_len`` field above should contain
-``sizeof(union xsk_tx_metadata)``.
+The headroom for the metadata is reserved via ``tx_metadata_len`` and
+``XDP_UMEM_TX_METADATA_LEN`` flag in ``struct xdp_umem_reg``. The metadata
+length is therefore the same for every socket that shares the same umem.
+The metadata layout is a fixed UAPI, refer to ``union xsk_tx_metadata`` in
+``include/uapi/linux/if_xdp.h``. Thus, generally, the ``tx_metadata_len``
+field above should contain ``sizeof(union xsk_tx_metadata)``.
+
+Note that in the original implementation the ``XDP_UMEM_TX_METADATA_LEN``
+flag was not required. Applications might attempt to create a umem
+with a flag first and if it fails, do another attempt without a flag.
The headroom and the metadata itself should be located right before
``xdp_desc->addr`` in the umem frame. Within a frame, the metadata