summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst34
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst18
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst45
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst10
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst67
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst22
-rw-r--r--Documentation/networking/ip-sysctl.rst36
-rw-r--r--Documentation/networking/scaling.rst4
-rw-r--r--Documentation/networking/tls-handshake.rst5
9 files changed, 193 insertions, 48 deletions
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 8bcb173e0353..5eaa3ab6c73e 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -38,6 +38,7 @@ debug logs.
Some of the ENA devices support a working mode called Low-latency
Queue (LLQ), which saves several more microseconds.
+
ENA Source Code Directory Structure
===================================
@@ -205,6 +206,8 @@ Adaptive coalescing can be switched on/off through `ethtool(8)`'s
More information about Adaptive Interrupt Moderation (DIM) can be found in
Documentation/networking/net_dim.rst
+.. _`RX copybreak`:
+
RX copybreak
============
The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
@@ -315,3 +318,34 @@ Rx
- The new SKB is updated with the necessary information (protocol,
checksum hw verify result, etc), and then passed to the network
stack, using the NAPI interface function :code:`napi_gro_receive()`.
+
+Dynamic RX Buffers (DRB)
+------------------------
+
+Each RX descriptor in the RX ring is a single memory page (which is either 4KB
+or 16KB long depending on system's configurations).
+To reduce the memory allocations required when dealing with a high rate of small
+packets, the driver tries to reuse the remaining RX descriptor's space if more
+than 2KB of this page remain unused.
+
+A simple example of this mechanism is the following sequence of events:
+
+::
+
+ 1. Driver allocates page-sized RX buffer and passes it to hardware
+ +----------------------+
+ |4KB RX Buffer |
+ +----------------------+
+
+ 2. A 300Bytes packet is received on this buffer
+
+ 3. The driver increases the ref count on this page and returns it back to
+ HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
+ +----+--------------------+
+ |****|3796 Bytes RX Buffer|
+ +----+--------------------+
+
+This mechanism isn't used when an XDP program is loaded, or when the
+RX packet is less than rx_copybreak bytes (in which case the packet is
+copied out of the RX buffer into the linear part of a new skb allocated
+for it and the RX buffer remains the same size, see `RX copybreak`_).
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
index 69695e5511f4..e4d065c55ea8 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -84,24 +84,6 @@ Once the VM shuts down, or otherwise releases the VF, the command will
complete.
-Important notes for SR-IOV and Link Aggregation
------------------------------------------------
-Link Aggregation is mutually exclusive with SR-IOV.
-
-- If Link Aggregation is active, SR-IOV VFs cannot be created on the PF.
-- If SR-IOV is active, you cannot set up Link Aggregation on the interface.
-
-Bridging and MACVLAN are also affected by this. If you wish to use bridging or
-MACVLAN with SR-IOV, you must set up bridging or MACVLAN before enabling
-SR-IOV. If you are using bridging or MACVLAN in conjunction with SR-IOV, and
-you want to remove the interface from the bridge or MACVLAN, you must follow
-these steps:
-
-1. Destroy SR-IOV VFs if they exist
-2. Remove the interface from the bridge or MACVLAN
-3. Recreate SRIOV VFs as needed
-
-
Additional Features and Configurations
======================================
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index 5ba9015336e2..bfd233cfac35 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -13,6 +13,7 @@ Contents
- `Drivers`_
- `Basic packet flow`_
- `Devlink health reporters`_
+- `Quality of service`_
Overview
========
@@ -287,3 +288,47 @@ For example::
NIX_AF_ERR:
NIX Error Interrupt Reg : 64
Rx on unmapped PF_FUNC
+
+
+Quality of service
+==================
+
+
+Hardware algorithms used in scheduling
+--------------------------------------
+
+octeontx2 silicon and CN10K transmit interface consists of five transmit levels
+starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1
+levels. Each level contains an array of queues to support scheduling and shaping.
+The hardware uses the below algorithms depending on the priority of scheduler queues.
+once the usercreates tc classes with different priorities, the driver configures
+schedulers allocated to the class with specified priority along with rate-limiting
+configuration.
+
+1. Strict Priority
+
+ - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority
+ using strict priority.
+
+2. Round Robin
+
+ - Active MDQs having the same priority level are chosen using round robin.
+
+
+Setup HTB offload
+-----------------
+
+1. Enable HW TC offload on the interface::
+
+ # ethtool -K <interface> hw-tc-offload on
+
+2. Crate htb root::
+
+ # tc qdisc add dev <interface> clsact
+ # tc qdisc replace dev <interface> root handle 1: htb offload
+
+3. Create tc classes with different priorities::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index 6b2d1fe74ecf..a395df9c2751 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -797,6 +797,16 @@ Counters on the NIC port that is connected to a eSwitch.
RoCE/UD/RC traffic) [#accel]_.
- Acceleration
+ * - `vport_loopback_packets`
+ - Unicast, multicast and broadcast packets that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_bytes`
+ - Unicast, multicast and broadcast bytes that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
* - `rx_steer_missed_packets`
- Number of packets that was received by the NIC, however was discarded
because it did not match any flow in the NIC flow table.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
index 3a7a714cc08f..a4edf908b707 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
@@ -40,6 +40,7 @@ flow_steering_mode: Device flow steering mode
---------------------------------------------
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
+
1. 'dmfs' - Device managed flow steering.
2. 'smfs' - Software/Driver managed flow steering.
@@ -99,6 +100,7 @@ between representors and stacked devices.
By default metadata is enabled on the supported devices in E-switch.
Metadata is applicable only for E-switch in switchdev mode and
users may disable it when NONE of the below use cases will be in use:
+
1. HCA is in Dual/multi-port RoCE mode.
2. VF/SF representor bonding (Usually used for Live migration)
3. Stacked devices
@@ -180,7 +182,8 @@ User commands examples:
$ devlink health diagnose pci/0000:82:00.0 reporter tx
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+.. note::
+ This command has valid output only when interface is up, otherwise the command has empty output.
- Show number of tx errors indicated, number of recover flows ended successfully,
is autorecover enabled and graceful period from last recover::
@@ -232,8 +235,9 @@ User commands examples:
$ devlink health dump show pci/0000:82:00.0 reporter fw
-NOTE: This command can run only on the PF which has fw tracer ownership,
-running it on other PF or any VF will return "Operation not permitted".
+.. note::
+ This command can run only on the PF which has fw tracer ownership,
+ running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
@@ -256,7 +260,8 @@ User commands examples:
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
-NOTE: This command can run only on PF.
+.. note::
+ This command can run only on PF.
vnic reporter
-------------
@@ -265,28 +270,44 @@ It is responsible for querying the vnic diagnostic counters from fw and displayi
them in realtime.
Description of the vnic counters:
-total_q_under_processor_handle: number of queues in an error state due to
-an async error or errored command.
-send_queue_priority_update_flow: number of QP/SQ priority/SL update
-events.
-cq_overrun: number of times CQ entered an error state due to an
-overflow.
-async_eq_overrun: number of times an EQ mapped to async events was
-overrun.
-comp_eq_overrun: number of times an EQ mapped to completion events was
-overrun.
-quota_exceeded_command: number of commands issued and failed due to quota
-exceeded.
-invalid_command: number of commands issued and failed dues to any reason
-other than quota exceeded.
-nic_receive_steering_discard: number of packets that completed RX flow
-steering but were discarded due to a mismatch in flow table.
+
+- total_q_under_processor_handle
+ number of queues in an error state due to
+ an async error or errored command.
+- send_queue_priority_update_flow
+ number of QP/SQ priority/SL update events.
+- cq_overrun
+ number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+ number of times an EQ mapped to async events was overrun.
+ comp_eq_overrun number of times an EQ mapped to completion events was
+ overrun.
+- quota_exceeded_command
+ number of commands issued and failed due to quota exceeded.
+- invalid_command
+ number of commands issued and failed dues to any reason other than quota
+ exceeded.
+- nic_receive_steering_discard
+ number of packets that completed RX flow
+ steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+ number of packets generated by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow).
+- handled_pkt_steering_fail
+ number of packets handled by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow owned by the VNIC, including the FDB
+ for the eswitch owner).
User commands examples:
-- Diagnose PF/VF vnic counters
+
+- Diagnose PF/VF vnic counters::
+
$ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
- Diagnose representor vnic counters (performed by supplying devlink port of the
- representor, which can be obtained via devlink port command)
+ representor, which can be obtained via devlink port command)::
+
$ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
-NOTE: This command can run over all interfaces such as PF/VF and representor ports.
+.. note::
+ This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
index 01deedb71597..6e3f5ee8b0d0 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -45,6 +45,28 @@ Following bridge VLAN functions are supported by mlx5:
Subfunction
===========
+Subfunction which are spawned over the E-switch are created only with devlink
+device, and by default all the SF auxiliary devices are disabled.
+This will allow user to configure the SF before the SF have been fully probed,
+which will save time.
+
+Usage example:
+
+- Create SF::
+
+ $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
+ $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active
+
+- Enable ETH auxiliary device::
+
+ $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit
+
+- Now, in order to fully probe the SF, use devlink reload::
+
+ $ devlink dev reload auxiliary/mlx5_core.sf.1
+
+mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
A subfunction has its own function capabilities and its own resources. This
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6ec06a33688a..4a010a7cde7f 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
- is 6, which corresponds to 63seconds till the last retransmission
- with the current initial RTO of 1second. With this the final timeout
- for an active TCP connection attempt will happen after 127seconds.
+ is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
+ till the last retransmission with the current initial RTO of 1second.
+ With this the final timeout for an active TCP connection attempt
+ will happen after 131seconds.
tcp_timestamps - INTEGER
Enable timestamps as defined in RFC1323.
@@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER
Default: 120
+tcp_syn_linear_timeouts - INTEGER
+ The number of times for an active TCP connection to retransmit SYNs with
+ a linear backoff timeout before defaulting to an exponential backoff
+ timeout. This has no effect on SYNACK at the passive TCP side.
+
+ With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
+ expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
+ and the first exponential backoff using 2^0 * initial_RTO).
+ Default: 4
+
tcp_tso_win_divisor - INTEGER
This allows control over what percentage of the congestion window
can be consumed by a single TSO frame.
@@ -970,6 +981,21 @@ tcp_tw_reuse - INTEGER
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
+tcp_shrink_window - BOOLEAN
+ This changes how the TCP receive window is calculated.
+
+ RFC 7323, section 2.4, says there are instances when a retracted
+ window can be offered, and that TCP implementations MUST ensure
+ that they handle a shrinking window, as specified in RFC 1122.
+
+ - 0 - Disabled. The window is never shrunk.
+ - 1 - Enabled. The window is shrunk when necessary to remain within
+ the memory limit set by autotuning (sk_rcvbuf).
+ This only occurs if a non-zero receive window
+ scaling factor is also in effect.
+
+ Default: 0
+
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
@@ -1352,8 +1378,8 @@ ping_group_range - 2 INTEGERS
Restrict ICMP_PROTO datagram sockets to users in the group range.
The default is "1 0", meaning, that nobody (not even root) may
create ping sockets. Setting it to "100 100" would grant permissions
- to the single group. "0 4294967295" would enable it for the world, "100
- 4294967295" would enable it for the users, but not daemons.
+ to the single group. "0 4294967294" would enable it for the world, "100
+ 4294967294" would enable it for the users, but not daemons.
tcp_early_demux - BOOLEAN
Enable early demux for established TCP sockets.
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 3d435caa3ef2..92c9fb46d6a2 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes.
rps_sock_flow_table is a global flow table that contains the *desired* CPU
for flows: the CPU that is currently processing the flow in userspace.
Each table value is a CPU index that is updated during calls to recvmsg
-and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
-and tcp_splice_read()).
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).
When the scheduler moves a thread to a new CPU while it has outstanding
receive packets on the old CPU, packets may arrive out of order. To
diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst
index a2817a88e905..6f5ea1646a47 100644
--- a/Documentation/networking/tls-handshake.rst
+++ b/Documentation/networking/tls-handshake.rst
@@ -53,6 +53,7 @@ fills in a structure that contains the parameters of the request:
struct socket *ta_sock;
tls_done_func_t ta_done;
void *ta_data;
+ const char *ta_peername;
unsigned int ta_timeout_ms;
key_serial_t ta_keyring;
key_serial_t ta_my_cert;
@@ -71,6 +72,10 @@ instantiated a struct file in sock->file.
has completed. Further explanation of this function is in the "Handshake
Completion" sesction below.
+The consumer can provide a NUL-terminated hostname in the @ta_peername
+field that is sent as part of ClientHello. If no peername is provided,
+the DNS hostname associated with the server's IP address is used instead.
+
The consumer can fill in the @ta_timeout_ms field to force the servicing
handshake agent to exit after a number of milliseconds. This enables the
socket to be fully closed once both the kernel and the handshake agent