diff options
Diffstat (limited to 'Documentation/networking')
20 files changed, 619 insertions, 432 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 1cc35de336a4..dceeb0d763aa 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt Gets options from an XDP socket. The only one supported so far is XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. +Multi-Buffer Support +==================== + +With multi-buffer support, programs using AF_XDP sockets can receive +and transmit packets consisting of multiple buffers both in copy and +zero-copy mode. For example, a packet can consist of two +frames/buffers, one with the header and the other one with the data, +or a 9K Ethernet jumbo frame can be constructed by chaining together +three 4K frames. + +Some definitions: + +* A packet consists of one or more frames + +* A descriptor in one of the AF_XDP rings always refers to a single + frame. In the case the packet consists of a single frame, the + descriptor refers to the whole packet. + +To enable multi-buffer support for an AF_XDP socket, use the new bind +flag XDP_USE_SG. If this is not provided, all multi-buffer packets +will be dropped just as before. Note that the XDP program loaded also +needs to be in multi-buffer mode. This can be accomplished by using +"xdp.frags" as the section name of the XDP program used. + +To represent a packet consisting of multiple frames, a new flag called +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx +descriptors. If it is true (1) the packet continues with the next +descriptor and if it is false (0) it means this is the last descriptor +of the packet. Why the reverse logic of end-of-packet (eop) flag found +in many NICs? Just to preserve compatibility with non-multi-buffer +applications that have this bit set to false for all packets on Rx, +and the apps set the options field to zero for Tx, as anything else +will be treated as an invalid descriptor. + +These are the semantics for producing packets onto AF_XDP Tx ring +consisting of multiple frames: + +* When an invalid descriptor is found, all the other + descriptors/frames of this packet are marked as invalid and not + completed. The next descriptor is treated as the start of a new + packet, even if this was not the intent (because we cannot guess + the intent). As before, if your program is producing invalid + descriptors you have a bug that must be fixed. + +* Zero length descriptors are treated as invalid descriptors. + +* For copy mode, the maximum supported number of frames in a packet is + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all + descriptors accumulated so far are dropped and treated as + invalid. To produce an application that will work on any system + regardless of this config setting, limit the number of frags to 18, + as the minimum value of the config is 17. + +* For zero-copy mode, the limit is up to what the NIC HW + supports. Usually at least five on the NICs we have checked. We + consciously chose to not enforce a rigid limit (such as + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have + resulted in copy actions under the hood to fit into what limit the + NIC supports. Kind of defeats the purpose of zero-copy mode. How to + probe for this limit is explained in the "probe for multi-buffer + support" section. + +On the Rx path in copy-mode, the xsk core copies the XDP data into +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as +detailed before. Zero-copy mode works the same, though the data is not +copied. When the application gets a descriptor with the XDP_PKT_CONTD +flag set to one, it means that the packet consists of multiple buffers +and it continues with the next buffer in the following +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it +means that this is the last buffer of the packet. AF_XDP guarantees +that only a complete packet (all frames in the packet) is sent to the +application. If there is not enough space in the AF_XDP Rx ring, all +frames of the packet will be dropped. + +If application reads a batch of descriptors, using for example the libxdp +interfaces, it is not guaranteed that the batch will end with a full +packet. It might end in the middle of a packet and the rest of the +buffers of that packet will arrive at the beginning of the next batch, +since the libxdp interface does not read the whole ring (unless you +have an enormous batch size or a very small ring size). + +An example program each for Rx and Tx multi-buffer support can be found +later in this document. + Usage -===== +----- In order to use AF_XDP sockets two parts are needed. The user-space application and the XDP program. For a complete setup and @@ -541,6 +625,131 @@ like this: But please use the libbpf functions as they are optimized and ready to use. Will make your life easier. +Usage Multi-Buffer Rx +--------------------- + +Here is a simple Rx path pseudo-code example (using libxdp interfaces +for simplicity). Error paths have been excluded to keep it short: + +.. code-block:: c + + void rx_packets(struct xsk_socket_info *xsk) + { + static bool new_packet = true; + u32 idx_rx = 0, idx_fq = 0; + static char *pkt; + + int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); + + xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); + + for (int i = 0; i < rcvd; i++) { + struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); + char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); + bool eop = !(desc->options & XDP_PKT_CONTD); + + if (new_packet) + pkt = frag; + else + add_frag_to_pkt(pkt, frag); + + if (eop) + process_pkt(pkt); + + new_packet = eop; + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); + xsk_ring_cons__release(&xsk->rx, rcvd); + } + +Usage Multi-Buffer Tx +--------------------- + +Here is an example Tx path pseudo-code (using libxdp interfaces for +simplicity) ignoring that the umem is finite in size, and that we +eventually will run out of packets to send. Also assumes pkts.addr +points to a valid location in the umem. + +.. code-block:: c + + void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, + int batch_size) + { + u32 idx, i, pkt_nb = 0; + + xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); + + for (i = 0; i < batch_size;) { + u64 addr = pkts[pkt_nb].addr; + u32 len = pkts[pkt_nb].size; + + do { + struct xdp_desc *tx_desc; + + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); + tx_desc->addr = addr; + + if (len > xsk_frame_size) { + tx_desc->len = xsk_frame_size; + tx_desc->options = XDP_PKT_CONTD; + } else { + tx_desc->len = len; + tx_desc->options = 0; + pkt_nb++; + } + len -= tx_desc->len; + addr += xsk_frame_size; + + if (i == batch_size) { + /* Remember len, addr, pkt_nb for next iteration. + * Skipped for simplicity. + */ + break; + } + } while (len); + } + + xsk_ring_prod__submit(&xsk->tx, i); + } + +Probing for Multi-Buffer Support +-------------------------------- + +To discover if a driver supports multi-buffer AF_XDP in SKB or DRV +mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to +query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for +querying for XDP multi-buffer support. If XDP supports multi-buffer in +a driver, then AF_XDP will also support that in SKB and DRV mode. + +To discover if a driver supports multi-buffer AF_XDP in zero-copy +mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY +flag. If it is set, it means that at least zero-copy is supported and +you should go and check the netlink attribute +NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer +value will be returned stating the max number of frags that are +supported by this device in zero-copy mode. These are the possible +return values: + +1: Multi-buffer for zero-copy is not supported by this device, as max + one fragment supported means that multi-buffer is not possible. + +>=2: Multi-buffer is supported in zero-copy mode for this device. The + returned number signifies the max number of frags supported. + +For an example on how these are used through libbpf, please take a +look at tools/testing/selftests/bpf/xskxceiver.c. + +Multi-Buffer Support for Zero-Copy Drivers +------------------------------------------ + +Zero-copy drivers usually use the batched APIs for Rx and Tx +processing. Note that the Tx batch API guarantees that it will provide +a batch of Tx descriptors that ends with full packet at the end. This +to facilitate extending a zero-copy driver with multi-buffer support. + Sample application ================== diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index 28925e19622d..f7a73421eb76 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -1636,7 +1636,7 @@ your init script:: ----------------------------------------- This section applies to distros which use /etc/network/interfaces file -to describe network interface configuration, most notably Debian and it's +to describe network interface configuration, most notably Debian and its derivatives. The ifup and ifdown commands on Debian don't support bonding out of diff --git a/Documentation/networking/device_drivers/ethernet/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst index 6d73ee78f3d7..31d621bca82e 100644 --- a/Documentation/networking/device_drivers/ethernet/google/gve.rst +++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst @@ -52,6 +52,15 @@ Descriptor Formats GVE supports two descriptor formats: GQI and DQO. These two formats have entirely different descriptors, which will be described below. +Addressing Mode +------------------ +GVE supports two addressing modes: QPL and RDA. +QPL ("queue-page-list") mode communicates data through a set of +pre-registered pages. + +For RDA ("raw DMA addressing") mode, the set of pages is dynamic. +Therefore, the packet buffers can be anywhere in guest memory. + Registers --------- All registers are MMIO. diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst index bfd233cfac35..1e196cb9ce25 100644 --- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst @@ -332,3 +332,11 @@ Setup HTB offload # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1 # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7 + +4. Create tc classes with same priorities and different quantum:: + + # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600 + + # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416 + + # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768 diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index a395df9c2751..f69ee1ebee01 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -346,6 +346,24 @@ the software port. - The number of receive packets with CQE compression on ring i [#accel]_. - Acceleration + * - `rx[i]_arfs_add` + - The number of aRFS flow rules added to the device for direct RQ steering + on ring i [#accel]_. + - Acceleration + + * - `rx[i]_arfs_request_in` + - Number of flow rules that have been requested to move into ring i for + direct RQ steering [#accel]_. + - Acceleration + + * - `rx[i]_arfs_request_out` + - Number of flow rules that have been requested to move out of ring i [#accel]_. + - Acceleration + + * - `rx[i]_arfs_expired` + - Number of flow rules that have been expired and removed [#accel]_. + - Acceleration + * - `rx[i]_arfs_err` - Number of flow rules that failed to be added to the flow table. - Error @@ -445,11 +463,6 @@ the software port. context. - Error - * - `rx[i]_xsk_arfs_err` - - aRFS (accelerated Receive Flow Steering) does not occur in the XSK RQ - context, so this counter should never increment. - - Error - * - `rx[i]_xdp_tx_xmit` - The number of packets forwarded back to the port due to XDP program `XDP_TX` action (bouncing). these packets are not counted by other @@ -683,6 +696,12 @@ the software port. time protocol. - Error + * - `ptp_cq[i]_late_cqe` + - Number of times a CQE has been delivered on the PTP timestamping CQ when + the CQE was not expected since a certain amount of time had elapsed where + the device typically ensures not posting the CQE. + - Error + .. [#ring_global] The corresponding ring and global counters do not share the same name (i.e. do not follow the common naming scheme). diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst deleted file mode 100644 index a4edf908b707..000000000000 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst +++ /dev/null @@ -1,313 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB -.. include:: <isonum.txt> - -======= -Devlink -======= - -:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. - -Contents -======== - -- `Info`_ -- `Parameters`_ -- `Health reporters`_ - -Info -==== - -The devlink info reports the running and stored firmware versions on device. -It also prints the device PSID which represents the HCA board type ID. - -User command example:: - - $ devlink dev info pci/0000:00:06.0 - pci/0000:00:06.0: - driver mlx5_core - versions: - fixed: - fw.psid MT_0000000009 - running: - fw.version 16.26.0100 - stored: - fw.version 16.26.0100 - -Parameters -========== - -flow_steering_mode: Device flow steering mode ---------------------------------------------- -The flow steering mode parameter controls the flow steering mode of the driver. -Two modes are supported: - -1. 'dmfs' - Device managed flow steering. -2. 'smfs' - Software/Driver managed flow steering. - -In DMFS mode, the HW steering entities are created and managed through the -Firmware. -In SMFS mode, the HW steering entities are created and managed though by -the driver directly into hardware without firmware intervention. - -SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode. - -User command examples: - -- Set SMFS flow steering mode:: - - $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime - -- Read device flow steering mode:: - - $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode - pci/0000:06:00.0: - name flow_steering_mode type driver-specific - values: - cmode runtime value smfs - -enable_roce: RoCE enablement state ----------------------------------- -If the device supports RoCE disablement, RoCE enablement state controls device -support for RoCE capability. Otherwise, the control occurs in the driver stack. -When RoCE is disabled at the driver level, only raw ethernet QPs are supported. - -To change RoCE enablement state, a user must change the driverinit cmode value -and run devlink reload. - -User command examples: - -- Disable RoCE:: - - $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit - $ devlink dev reload pci/0000:06:00.0 - -- Read RoCE enablement state:: - - $ devlink dev param show pci/0000:06:00.0 name enable_roce - pci/0000:06:00.0: - name enable_roce type generic - values: - cmode driverinit value true - -esw_port_metadata: Eswitch port metadata state ----------------------------------------------- -When applicable, disabling eswitch metadata can increase packet rate -up to 20% depending on the use case and packet sizes. - -Eswitch port metadata state controls whether to internally tag packets with -metadata. Metadata tagging must be enabled for multi-port RoCE, failover -between representors and stacked devices. -By default metadata is enabled on the supported devices in E-switch. -Metadata is applicable only for E-switch in switchdev mode and -users may disable it when NONE of the below use cases will be in use: - -1. HCA is in Dual/multi-port RoCE mode. -2. VF/SF representor bonding (Usually used for Live migration) -3. Stacked devices - -When metadata is disabled, the above use cases will fail to initialize if -users try to enable them. - -- Show eswitch port metadata:: - - $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata - pci/0000:06:00.0: - name esw_port_metadata type driver-specific - values: - cmode runtime value true - -- Disable eswitch port metadata:: - - $ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime - -- Change eswitch mode to switchdev mode where after choosing the metadata value:: - - $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev - -hairpin_num_queues: Number of hairpin queues --------------------------------------------- -We refer to a TC NIC rule that involves forwarding as "hairpin". - -Hairpin queues are mlx5 hardware specific implementation for hardware -forwarding of such packets. - -- Show the number of hairpin queues:: - - $ devlink dev param show pci/0000:06:00.0 name hairpin_num_queues - pci/0000:06:00.0: - name hairpin_num_queues type driver-specific - values: - cmode driverinit value 2 - -- Change the number of hairpin queues:: - - $ devlink dev param set pci/0000:06:00.0 name hairpin_num_queues value 4 cmode driverinit - -hairpin_queue_size: Size of the hairpin queues ----------------------------------------------- -Control the size of the hairpin queues. - -- Show the size of the hairpin queues:: - - $ devlink dev param show pci/0000:06:00.0 name hairpin_queue_size - pci/0000:06:00.0: - name hairpin_queue_size type driver-specific - values: - cmode driverinit value 1024 - -- Change the size (in packets) of the hairpin queues:: - - $ devlink dev param set pci/0000:06:00.0 name hairpin_queue_size value 512 cmode driverinit - -Health reporters -================ - -tx reporter ------------ -The tx reporter is responsible for reporting and recovering of the following two error scenarios: - -- tx timeout - Report on kernel tx timeout detection. - Recover by searching lost interrupts. -- tx error completion - Report on error tx completion. - Recover by flushing the tx queue and reset it. - -tx reporter also support on demand diagnose callback, on which it provides -real time information of its send queues status. - -User commands examples: - -- Diagnose send queues status:: - - $ devlink health diagnose pci/0000:82:00.0 reporter tx - -.. note:: - This command has valid output only when interface is up, otherwise the command has empty output. - -- Show number of tx errors indicated, number of recover flows ended successfully, - is autorecover enabled and graceful period from last recover:: - - $ devlink health show pci/0000:82:00.0 reporter tx - -rx reporter ------------ -The rx reporter is responsible for reporting and recovering of the following two error scenarios: - -- rx queues' initialization (population) timeout - Population of rx queues' descriptors on ring initialization is done - in napi context via triggering an irq. In case of a failure to get - the minimum amount of descriptors, a timeout would occur, and - descriptors could be recovered by polling the EQ (Event Queue). -- rx completions with errors (reported by HW on interrupt context) - Report on rx completion error. - Recover (if needed) by flushing the related queue and reset it. - -rx reporter also supports on demand diagnose callback, on which it -provides real time information of its receive queues' status. - -- Diagnose rx queues' status and corresponding completion queue:: - - $ devlink health diagnose pci/0000:82:00.0 reporter rx - -NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output. - -- Show number of rx errors indicated, number of recover flows ended successfully, - is autorecover enabled, and graceful period from last recover:: - - $ devlink health show pci/0000:82:00.0 reporter rx - -fw reporter ------------ -The fw reporter implements `diagnose` and `dump` callbacks. -It follows symptoms of fw error such as fw syndrome by triggering -fw core dump and storing it into the dump buffer. -The fw reporter diagnose command can be triggered any time by the user to check -current fw status. - -User commands examples: - -- Check fw heath status:: - - $ devlink health diagnose pci/0000:82:00.0 reporter fw - -- Read FW core dump if already stored or trigger new one:: - - $ devlink health dump show pci/0000:82:00.0 reporter fw - -.. note:: - This command can run only on the PF which has fw tracer ownership, - running it on other PF or any VF will return "Operation not permitted". - -fw fatal reporter ------------------ -The fw fatal reporter implements `dump` and `recover` callbacks. -It follows fatal errors indications by CR-space dump and recover flow. -The CR-space dump uses vsc interface which is valid even if the FW command -interface is not functional, which is the case in most FW fatal errors. -The recover function runs recover flow which reloads the driver and triggers fw -reset if needed. -On firmware error, the health buffer is dumped into the dmesg. The log -level is derived from the error's severity (given in health buffer). - -User commands examples: - -- Run fw recover flow manually:: - - $ devlink health recover pci/0000:82:00.0 reporter fw_fatal - -- Read FW CR-space dump if already stored or trigger new one:: - - $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal - -.. note:: - This command can run only on PF. - -vnic reporter -------------- -The vnic reporter implements only the `diagnose` callback. -It is responsible for querying the vnic diagnostic counters from fw and displaying -them in realtime. - -Description of the vnic counters: - -- total_q_under_processor_handle - number of queues in an error state due to - an async error or errored command. -- send_queue_priority_update_flow - number of QP/SQ priority/SL update events. -- cq_overrun - number of times CQ entered an error state due to an overflow. -- async_eq_overrun - number of times an EQ mapped to async events was overrun. - comp_eq_overrun number of times an EQ mapped to completion events was - overrun. -- quota_exceeded_command - number of commands issued and failed due to quota exceeded. -- invalid_command - number of commands issued and failed dues to any reason other than quota - exceeded. -- nic_receive_steering_discard - number of packets that completed RX flow - steering but were discarded due to a mismatch in flow table. -- generated_pkt_steering_fail - number of packets generated by the VNIC experiencing unexpected steering - failure (at any point in steering flow). -- handled_pkt_steering_fail - number of packets handled by the VNIC experiencing unexpected steering - failure (at any point in steering flow owned by the VNIC, including the FDB - for the eswitch owner). - -User commands examples: - -- Diagnose PF/VF vnic counters:: - - $ devlink health diagnose pci/0000:82:00.1 reporter vnic - -- Diagnose representor vnic counters (performed by supplying devlink port of the - representor, which can be obtained via devlink port command):: - - $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic - -.. note:: - This command can run over all interfaces such as PF/VF and representor ports. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst index 3fdcd6b61ccf..581a91caa579 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst @@ -13,7 +13,6 @@ Contents: :maxdepth: 2 kconfig - devlink switchdev tracepoints counters diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst index 43b1f7e87ec4..0a42c3395ffa 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst @@ -36,7 +36,7 @@ Enabling the driver and kconfig options **CONFIG_MLX5_CORE_EN_DCB=(y/n)**: -| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. +| Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. **CONFIG_MLX5_CORE_IPOIB=(y/n)** @@ -59,12 +59,12 @@ Enabling the driver and kconfig options **CONFIG_MLX5_EN_ARFS=(y/n)** | Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering. -| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4 +| https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4 **CONFIG_MLX5_EN_IPSEC=(y/n)** -| Enables `IPSec XFRM cryptography-offload acceleration <https://support.mellanox.com/s/article/ConnectX-6DX-Bluefield-2-IPsec-HW-Full-Offload-Configuration-Guide>`_. +| Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`. **CONFIG_MLX5_EN_MACSEC=(y/n)** @@ -87,8 +87,8 @@ Enabling the driver and kconfig options | Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering | and switching for the enabled VFs and PF in two available modes: -| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_. -| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_. +| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_. +| 2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`. **CONFIG_MLX5_FPGA=(y/n)** @@ -101,13 +101,13 @@ Enabling the driver and kconfig options **CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko) -| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. +| Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. **CONFIG_MLX5_MPFS=(y/n)** | Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC. -| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing +| MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing | user configured unicast MAC addresses to the requesting PF. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst index 6e3f5ee8b0d0..b617e93d7c2c 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -190,6 +190,26 @@ explicitly enable the VF migratable capability. mlx5 driver support devlink port function attr mechanism to setup migratable capability. (refer to Documentation/networking/devlink/devlink-port.rst) +IPsec crypto capability setup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need +to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability +for VFs is supported starting with ConnectX6dx devices and above. When a VF has +IPsec capability enabled, any IPsec offloading is blocked on the PF. + +mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto +capability. (refer to Documentation/networking/devlink/devlink-port.rst) + +IPsec packet capability setup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need +to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability +for VFs is supported starting with ConnectX6dx devices and above. When a VF has +IPsec capability enabled, any IPsec offloading is blocked on the PF. + +mlx5 driver support devlink port function attr mechanism to setup ipsec_packet +capability. (refer to Documentation/networking/devlink/devlink-port.rst) + SF state setup -------------- diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst index 3da590953ce8..e33ad2401ad7 100644 --- a/Documentation/networking/devlink/devlink-port.rst +++ b/Documentation/networking/devlink/devlink-port.rst @@ -128,6 +128,12 @@ Users may also set the RoCE capability of the function using Users may also set the function as migratable using 'devlink port function set migratable' command. +Users may also set the IPsec crypto capability of the function using +`devlink port function set ipsec_crypto` command. + +Users may also set the IPsec packet capability of the function using +`devlink port function set ipsec_packet` command. + Function attributes =================== @@ -240,6 +246,55 @@ Attach VF to the VM. Start the VM. Perform live migration. +IPsec crypto capability setup +----------------------------- +When user enables IPsec crypto capability for a VF, user application can offload +XFRM state crypto operation (Encrypt/Decrypt) to this VF. + +When IPsec crypto capability is disabled (default) for a VF, the XFRM state is +processed in software by the kernel. + +- Get IPsec crypto capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 ipsec_crypto disabled + +- Set IPsec crypto capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 ipsec_crypto enabled + +IPsec packet capability setup +----------------------------- +When user enables IPsec packet capability for a VF, user application can offload +XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as +IPsec encapsulation. + +When IPsec packet capability is disabled (default) for a VF, the XFRM state and +policy is processed in software by the kernel. + +- Get IPsec packet capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 ipsec_packet disabled + +- Set IPsec packet capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 ipsec_packet enabled + Subfunction ============ @@ -321,9 +376,9 @@ API allows to configure following rate object's parameters: Allows for usage of Weighted Fair Queuing arbitration scheme among siblings. This arbitration scheme can be used simultaneously with the strict priority. As a node is configured with a higher rate it gets more - BW relative to it's siblings. Values are relative like a percentage + BW relative to its siblings. Values are relative like a percentage points, they basically tell how much BW should node take relative to - it's siblings. + its siblings. ``parent`` Parent node name. Parent node rate limits are considered as additional limits @@ -343,7 +398,7 @@ Arbitration flow from the high level: #. If group of nodes have the same priority perform WFQ arbitration on that subgroup. Use ``tx_weight`` as a parameter for this arbitration. -#. Select the winner node, and continue arbitration flow among it's children, +#. Select the winner node, and continue arbitration flow among its children, until leaf node is reached, and the winner is established. #. If all the nodes from the highest priority sub-group are satisfied, or diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 202798d6501e..702f204a3dbd 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -18,6 +18,11 @@ Parameters * - ``enable_roce`` - driverinit - Type: Boolean + + If the device supports RoCE disablement, RoCE enablement state controls + device support for RoCE capability. Otherwise, the control occurs in the + driver stack. When RoCE is disabled at the driver level, only raw + ethernet QPs are supported. * - ``io_eq_size`` - driverinit - The range is between 64 and 4096. @@ -48,6 +53,9 @@ parameters. * ``smfs`` Software managed flow steering. In SMFS mode, the HW steering entities are created and manage through the driver without firmware intervention. + + SMFS mode is faster and provides better rule insertion rate compared to + default DMFS mode. * - ``fdb_large_groups`` - u32 - driverinit @@ -71,7 +79,24 @@ parameters. deprecated. Default: disabled + * - ``esw_port_metadata`` + - Boolean + - runtime + - When applicable, disabling eswitch metadata can increase packet rate up + to 20% depending on the use case and packet sizes. + + Eswitch port metadata state controls whether to internally tag packets + with metadata. Metadata tagging must be enabled for multi-port RoCE, + failover between representors and stacked devices. By default metadata is + enabled on the supported devices in E-switch. Metadata is applicable only + for E-switch in switchdev mode and users may disable it when NONE of the + below use cases will be in use: + 1. HCA is in Dual/multi-port RoCE mode. + 2. VF/SF representor bonding (Usually used for Live migration) + 3. Stacked devices + When metadata is disabled, the above use cases will fail to initialize if + users try to enable them. * - ``hairpin_num_queues`` - u32 - driverinit @@ -104,3 +129,160 @@ The ``mlx5`` driver reports the following versions * - ``fw.version`` - stored, running - Three digit major.minor.subminor firmware version number. + +Health reporters +================ + +tx reporter +----------- +The tx reporter is responsible for reporting and recovering of the following three error scenarios: + +- tx timeout + Report on kernel tx timeout detection. + Recover by searching lost interrupts. +- tx error completion + Report on error tx completion. + Recover by flushing the tx queue and reset it. +- tx PTP port timestamping CQ unhealthy + Report too many CQEs never delivered on port ts CQ. + Recover by flushing and re-creating all PTP channels. + +tx reporter also support on demand diagnose callback, on which it provides +real time information of its send queues status. + +User commands examples: + +- Diagnose send queues status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter tx + +.. note:: + This command has valid output only when interface is up, otherwise the command has empty output. + +- Show number of tx errors indicated, number of recover flows ended successfully, + is autorecover enabled and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter tx + +rx reporter +----------- +The rx reporter is responsible for reporting and recovering of the following two error scenarios: + +- rx queues' initialization (population) timeout + Population of rx queues' descriptors on ring initialization is done + in napi context via triggering an irq. In case of a failure to get + the minimum amount of descriptors, a timeout would occur, and + descriptors could be recovered by polling the EQ (Event Queue). +- rx completions with errors (reported by HW on interrupt context) + Report on rx completion error. + Recover (if needed) by flushing the related queue and reset it. + +rx reporter also supports on demand diagnose callback, on which it +provides real time information of its receive queues' status. + +- Diagnose rx queues' status and corresponding completion queue:: + + $ devlink health diagnose pci/0000:82:00.0 reporter rx + +.. note:: + This command has valid output only when interface is up. Otherwise, the command has empty output. + +- Show number of rx errors indicated, number of recover flows ended successfully, + is autorecover enabled, and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter rx + +fw reporter +----------- +The fw reporter implements `diagnose` and `dump` callbacks. +It follows symptoms of fw error such as fw syndrome by triggering +fw core dump and storing it into the dump buffer. +The fw reporter diagnose command can be triggered any time by the user to check +current fw status. + +User commands examples: + +- Check fw heath status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter fw + +- Read FW core dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.0 reporter fw + +.. note:: + This command can run only on the PF which has fw tracer ownership, + running it on other PF or any VF will return "Operation not permitted". + +fw fatal reporter +----------------- +The fw fatal reporter implements `dump` and `recover` callbacks. +It follows fatal errors indications by CR-space dump and recover flow. +The CR-space dump uses vsc interface which is valid even if the FW command +interface is not functional, which is the case in most FW fatal errors. +The recover function runs recover flow which reloads the driver and triggers fw +reset if needed. +On firmware error, the health buffer is dumped into the dmesg. The log +level is derived from the error's severity (given in health buffer). + +User commands examples: + +- Run fw recover flow manually:: + + $ devlink health recover pci/0000:82:00.0 reporter fw_fatal + +- Read FW CR-space dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal + +.. note:: + This command can run only on PF. + +vnic reporter +------------- +The vnic reporter implements only the `diagnose` callback. +It is responsible for querying the vnic diagnostic counters from fw and displaying +them in realtime. + +Description of the vnic counters: + +- total_q_under_processor_handle + number of queues in an error state due to + an async error or errored command. +- send_queue_priority_update_flow + number of QP/SQ priority/SL update events. +- cq_overrun + number of times CQ entered an error state due to an overflow. +- async_eq_overrun + number of times an EQ mapped to async events was overrun. + comp_eq_overrun number of times an EQ mapped to completion events was + overrun. +- quota_exceeded_command + number of commands issued and failed due to quota exceeded. +- invalid_command + number of commands issued and failed dues to any reason other than quota + exceeded. +- nic_receive_steering_discard + number of packets that completed RX flow + steering but were discarded due to a mismatch in flow table. +- generated_pkt_steering_fail + number of packets generated by the VNIC experiencing unexpected steering + failure (at any point in steering flow). +- handled_pkt_steering_fail + number of packets handled by the VNIC experiencing unexpected steering + failure (at any point in steering flow owned by the VNIC, including the FDB + for the eswitch owner). + +User commands examples: + +- Diagnose PF/VF vnic counters:: + + $ devlink health diagnose pci/0000:82:00.1 reporter vnic + +- Diagnose representor vnic counters (performed by supplying devlink port of the + representor, which can be obtained via devlink port command):: + + $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic + +.. note:: + This command can run over all interfaces such as PF/VF and representor ports. diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 4a010a7cde7f..a66054d0763a 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -321,6 +321,7 @@ tcp_abort_on_overflow - BOOLEAN option can harm clients of your server. tcp_adv_win_scale - INTEGER + Obsolete since linux-6.6 Count buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), if it is <= 0. @@ -2287,6 +2288,14 @@ accept_ra_min_hop_limit - INTEGER Default: 1 +accept_ra_min_lft - INTEGER + Minimum acceptable lifetime value in Router Advertisement. + + RA sections with a lifetime less than this value shall be + ignored. Zero lifetimes stay unaffected. + + Default: 0 + accept_ra_pinfo - BOOLEAN Learn Prefix Information in Router Advertisement. diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 213510698014..15f1919d640c 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -74,3 +74,11 @@ stale_loss_cnt - INTEGER This is a per-namespace sysctl. Default: 4 + +scheduler - STRING + Select the scheduler of your choice. + + Support for selection of different schedulers. This is a per-namespace + sysctl. + + Default: "default" diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst index a7a047742e93..7bf7b95c4f7a 100644 --- a/Documentation/networking/napi.rst +++ b/Documentation/networking/napi.rst @@ -65,15 +65,16 @@ argument - drivers can process completions for any number of Tx packets but should only process up to ``budget`` number of Rx packets. Rx processing is usually much more expensive. -In other words, it is recommended to ignore the budget argument when -performing TX buffer reclamation to ensure that the reclamation is not -arbitrarily bounded; however, it is required to honor the budget argument -for RX processing. +In other words for Rx processing the ``budget`` argument limits how many +packets driver can process in a single poll. Rx specific APIs like page +pool or XDP cannot be used at all when ``budget`` is 0. +skb Tx processing should happen regardless of the ``budget``, but if +the argument is 0 driver cannot call any XDP (or page pool) APIs. .. warning:: - The ``budget`` argument may be 0 if core tries to only process Tx completions - and no Rx packets. + The ``budget`` argument may be 0 if core tries to only process + skb Tx completions and no Rx or XDP packets. The poll method returns the amount of work done. If the driver still has outstanding work to do (e.g. ``budget`` was exhausted) diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst index dd0518e002f6..7a9de0568e84 100644 --- a/Documentation/networking/netconsole.rst +++ b/Documentation/networking/netconsole.rst @@ -13,6 +13,8 @@ IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013 Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015 +Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023 + Please send bug reports to Matt Mackall <mpm@selenic.com> Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com> @@ -34,10 +36,11 @@ Sender and receiver configuration: It takes a string configuration parameter "netconsole" in the following format:: - netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr] + netconsole=[+][r][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr] where + if present, enable extended console support + r if present, prepend kernel version (release) to the message src-port source for UDP packets (defaults to 6665) src-ip source IP to use (interface address) dev network interface (eth0) @@ -125,6 +128,7 @@ The interface exposes these parameters of a netconsole target to userspace: ============== ================================= ============ enabled Is this target currently enabled? (read-write) extended Extended mode enabled (read-write) + release Prepend kernel release to message (read-write) dev_name Local network interface name (read-write) local_port Source UDP port to use (read-write) remote_port Remote agent's UDP port (read-write) @@ -165,6 +169,11 @@ following format which is the same as /dev/kmsg:: <level>,<sequnum>,<timestamp>,<contflag>;<message text> +If 'r' (release) feature is enabled, the kernel release version is +prepended to the start of the message. Example:: + + 6.4.0,6,444,501151268,-;netconsole: network logging started + Non printable characters in <message text> are escaped using "\xff" notation. If the message contains optional dictionary, verbatim newline is used as the delimiter. diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index 8b1045c3b59e..c383a394c665 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -178,10 +178,10 @@ nf_conntrack_sctp_timeout_established - INTEGER (seconds) Default is set to (hb_interval * path_max_retrans + rto_max) nf_conntrack_sctp_timeout_shutdown_sent - INTEGER (seconds) - default 0.3 + default 3 nf_conntrack_sctp_timeout_shutdown_recd - INTEGER (seconds) - default 0.3 + default 3 nf_conntrack_sctp_timeout_shutdown_ack_sent - INTEGER (seconds) default 3 diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index c5da1a5d93de..30a3be3c48f3 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -755,7 +755,7 @@ AF_PACKET TPACKET_V3 example ============================ AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame -sizes by doing it's own memory management. It is based on blocks where polling +sizes by doing its own memory management. It is based on blocks where polling works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. It is said that TPACKET_V3 brings the following benefits: diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index 873efd97f822..215ebc92752c 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -4,22 +4,8 @@ Page Pool API ============= -The page_pool allocator is optimized for the XDP mode that uses one frame -per-page, but it can fallback on the regular page allocator APIs. - -Basic use involves replacing alloc_pages() calls with the -page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages() -replacing dev_alloc_pages(). - -API keeps track of in-flight pages, in order to let API user know -when it is safe to free a page_pool object. Thus, API users -must run page_pool_release_page() when a page is leaving the page_pool or -call page_pool_put_page() where appropriate in order to maintain correct -accounting. - -API user must call page_pool_put_page() once on a page, as it -will either recycle the page, or in case of refcnt > 1, it will -release the DMA mapping and in-flight state accounting. +.. kernel-doc:: include/net/page_pool/helpers.h + :doc: page_pool allocator Architecture overview ===================== @@ -64,87 +50,68 @@ This lockless guarantee naturally comes from running under a NAPI softirq. The protection doesn't strictly have to be NAPI, any guarantee that allocating a page will cause no race conditions is enough. -* page_pool_create(): Create a pool. - * flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV - * order: 2^order pages on allocation - * pool_size: size of the ptr_ring - * nid: preferred NUMA node for allocation - * dev: struct device. Used on DMA operations - * dma_dir: DMA direction - * max_len: max DMA sync memory size - * offset: DMA address offset - -* page_pool_put_page(): The outcome of this depends on the page refcnt. If the - driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1 - the allocator owns the page and will try to recycle it in one of the pool - caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device - using dma_sync_single_range_for_device(). - -* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync - for the entire memory area configured in area pool->max_len. - -* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller - must guarantee safe context (e.g NAPI), since it will recycle the page - directly into the pool fast cache. - -* page_pool_release_page(): Unmap the page (if mapped) and account for it on - in-flight counters. - -* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool - caches. - -* page_pool_get_dma_addr(): Retrieve the stored DMA address. - -* page_pool_get_dma_dir(): Retrieve the stored DMA direction. - -* page_pool_put_page_bulk(): Tries to refill a number of pages into the - ptr_ring cache holding ptr_ring producer lock. If the ptr_ring is full, - page_pool_put_page_bulk() will release leftover pages to the page allocator. - page_pool_put_page_bulk() is suitable to be run inside the driver NAPI tx - completion loop for the XDP_REDIRECT use case. - Please note the caller must not use data area after running - page_pool_put_page_bulk(), as this function overwrites it. - -* page_pool_get_stats(): Retrieve statistics about the page_pool. This API - is only available if the kernel has been configured with - ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct - page_pool_stats`` structure is passed to this API which is filled in. The - caller can then report those stats to the user (perhaps via ethtool, - debugfs, etc.). See below for an example usage of this API. +.. kernel-doc:: net/core/page_pool.c + :identifiers: page_pool_create + +.. kernel-doc:: include/net/page_pool/types.h + :identifiers: struct page_pool_params + +.. kernel-doc:: include/net/page_pool/helpers.h + :identifiers: page_pool_put_page page_pool_put_full_page + page_pool_recycle_direct page_pool_dev_alloc_pages + page_pool_get_dma_addr page_pool_get_dma_dir + +.. kernel-doc:: net/core/page_pool.c + :identifiers: page_pool_put_page_bulk page_pool_get_stats + +DMA sync +-------- +Driver is always responsible for syncing the pages for the CPU. +Drivers may choose to take care of syncing for the device as well +or set the ``PP_FLAG_DMA_SYNC_DEV`` flag to request that pages +allocated from the page pool are already synced for the device. + +If ``PP_FLAG_DMA_SYNC_DEV`` is set, the driver must inform the core what portion +of the buffer has to be synced. This allows the core to avoid syncing the entire +page when the drivers knows that the device only accessed a portion of the page. + +Most drivers will reserve headroom in front of the frame. This part +of the buffer is not touched by the device, so to avoid syncing +it drivers can set the ``offset`` field in struct page_pool_params +appropriately. + +For pages recycled on the XDP xmit and skb paths the page pool will +use the ``max_len`` member of struct page_pool_params to decide how +much of the page needs to be synced (starting at ``offset``). +When directly freeing pages in the driver (page_pool_put_page()) +the ``dma_sync_size`` argument specifies how much of the buffer needs +to be synced. + +If in doubt set ``offset`` to 0, ``max_len`` to ``PAGE_SIZE`` and +pass -1 as ``dma_sync_size``. That combination of arguments is always +correct. + +Note that the syncing parameters are for the entire page. +This is important to remember when using fragments (``PP_FLAG_PAGE_FRAG``), +where allocated buffers may be smaller than a full page. +Unless the driver author really understands page pool internals +it's recommended to always use ``offset = 0``, ``max_len = PAGE_SIZE`` +with fragmented page pools. Stats API and structures ------------------------ If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API -``page_pool_get_stats()`` and structures described below are available. It -takes a pointer to a ``struct page_pool`` and a pointer to a ``struct -page_pool_stats`` allocated by the caller. +page_pool_get_stats() and structures described below are available. +It takes a pointer to a ``struct page_pool`` and a pointer to a struct +page_pool_stats allocated by the caller. -The API will fill in the provided ``struct page_pool_stats`` with +The API will fill in the provided struct page_pool_stats with statistics about the page_pool. -The stats structure has the following fields:: - - struct page_pool_stats { - struct page_pool_alloc_stats alloc_stats; - struct page_pool_recycle_stats recycle_stats; - }; - - -The ``struct page_pool_alloc_stats`` has the following fields: - * ``fast``: successful fast path allocations - * ``slow``: slow path order-0 allocations - * ``slow_high_order``: slow path high order allocations - * ``empty``: ptr ring is empty, so a slow path allocation was forced. - * ``refill``: an allocation which triggered a refill of the cache - * ``waive``: pages obtained from the ptr ring that cannot be added to - the cache due to a NUMA mismatch. - -The ``struct page_pool_recycle_stats`` has the following fields: - * ``cached``: recycling placed page in the page pool cache - * ``cache_full``: page pool cache was full - * ``ring``: page placed into the ptr ring - * ``ring_full``: page released from page pool because the ptr ring was full - * ``released_refcnt``: page released (and not recycled) because refcnt > 1 +.. kernel-doc:: include/net/page_pool/types.h + :identifiers: struct page_pool_recycle_stats + struct page_pool_alloc_stats + struct page_pool_stats Coding examples =============== @@ -194,7 +161,7 @@ NAPI poller if XDP_DROP: page_pool_recycle_direct(page_pool, page); } else (packet_is_skb) { - page_pool_release_page(page_pool, page); + skb_mark_for_recycle(skb); new_page = page_pool_dev_alloc_pages(page_pool); } } diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index b7ac4c64cf67..1283240d7620 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -323,6 +323,10 @@ Some of the interface modes are described below: contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this interface mode has different autonegotiation and only supports full duplex. +``PHY_INTERFACE_MODE_PSGMII`` + This is the Penta SGMII mode, it is similar to QSGMII but it combines 5 + SGMII lines into a single link compared to 4 on QSGMII. + Pause frames / flow control =========================== diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst index 83abdfef4ec3..535077cbeb07 100644 --- a/Documentation/networking/xfrm_device.rst +++ b/Documentation/networking/xfrm_device.rst @@ -1,4 +1,5 @@ .. SPDX-License-Identifier: GPL-2.0 +.. _xfrm_device: =============================================== XFRM device - offloading the IPsec computations |