summaryrefslogtreecommitdiff
path: root/Documentation/networking/ip-sysctl.rst
AgeCommit message (Collapse)AuthorFilesLines
2022-09-20tcp: Introduce optional per-netns ehash.Kuniyuki Iwashima1-0/+29
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-01tcp: make global challenge ack rate limitation per net-ns and default disabledEric Dumazet1-1/+4
Because per host rate limiting has been proven problematic (side channel attacks can be based on it), per host rate limiting of challenge acks ideally should be per netns and turned off by default. This is a long due followup of following commits: 083ae308280d ("tcp: enable per-socket rate limiting of all 'challenge acks'") f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") 75ff39ccc1bd ("tcp: make challenge acks less predictable") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-29Documentation: Describe net.ipv4.tcp_reflect_tos.Guillaume Nault1-0/+10
The tcp_reflect_tos option was introduced in Linux 5.10 but was still undocumented. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+8
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-24Documentation: fix sctp_wmem in ip-sysctl.rstXin Long1-1/+8
Since commit 1033990ac5b2 ("sctp: implement memory accounting on tx path"), SCTP has supported memory accounting on tx path where 'sctp_wmem' is used by sk_wmem_schedule(). So we should fix the description for this option in ip-sysctl.rst accordingly. v1->v2: - Improve the description as Marcelo suggested. Fixes: 1033990ac5b2 ("sctp: implement memory accounting on tx path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+1
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-20Documentation: fix udp_wmem_min in ip-sysctl.rstXin Long1-5/+1
UDP doesn't support tx memory accounting, and sysctl udp_wmem_min is not really used anywhere. So we should fix the description in ip-sysctl.rst accordingly. Fixes: 95766fff6b9a ("[UDP]: Add memory accounting.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/c880a963d9b1fb5f442ae3c9e4dfa70d45296a16.1658167019.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-16net: ipv6: new accept_untracked_na option to accept na only if in-networkJaehee Park1-21/+30
This patch adds a third knob, '2', which extends the accept_untracked_na option to learn a neighbor only if the src ip is in the same subnet as an address configured on the interface that received the neighbor advertisement. This is similar to the arp_accept configuration for ipv4. Signed-off-by: Jaehee Park <jhpark1013@gmail.com> Suggested-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-16net: ipv4: new arp_accept option to accept garp only if in-networkJaehee Park1-3/+6
In many deployments, we want the option to not learn a neighbor from garp if the src ip is not in the same subnet as an address configured on the interface that received the garp message. net.ipv4.arp_accept sysctl is currently used to control creation of a neigh from a received garp packet. This patch adds a new option '2' to net.ipv4.arp_accept which extends option '1' by including the subnet check. Signed-off-by: Jaehee Park <jhpark1013@gmail.com> Suggested-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+2
include/net/sock.h 310731e2f161 ("net: Fix data-races around sysctl_mem.") e70f3c701276 ("Revert "net: set SK_MEM_QUANTUM to 4096"") https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/ net/ipv4/fib_semantics.c 747c14307214 ("ip: fix dflt addr selection for connected nexthop") d62607c3fe45 ("net: rename reference+tracking helpers") net/tls/tls.h include/net/tls.h 3d8c51b25a23 ("net/tls: Check for errors in tls_device_init") 587903142308 ("tls: create an internal header") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-13ipv4: Fix data-races around sysctl_ip_dynaddr.Kuniyuki Iwashima1-1/+1
While reading sysctl_ip_dynaddr, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08cipso: Fix data-races around sysctl.Kuniyuki Iwashima1-1/+1
While reading cipso sysctl variables, they can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-30net, neigh: introduce interval_probe_time_ms for periodic probeYuwei Wang1-0/+6
commit ed6cd6a17896 ("net, neigh: Set lower cap for neigh_managed_work rearming") fixed a case when DELAY_PROBE_TIME is configured to 0, the processing of the system work queue hog CPU to 100%, and further more we should introduce a new option used by periodic probe Signed-off-by: Yuwei Wang <wangyuweihx@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-11Documentation: add description for net.sctp.ecn_enableXin Long1-0/+12
Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: 2f5268a9249b ("sctp: allow users to set netns ecn flag with sysctl") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-11Documentation: add description for net.sctp.intl_enableXin Long1-0/+14
Describe it in networking/ip-sysctl.rst like other SCTP options. We need to document this especially as when using the feature of User Message Interleaving, some socket options also needs to be set. Fixes: 463118c34a35 ("sctp: support sysctl to allow users to use stream interleave") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-11Documentation: add description for net.sctp.reconf_enableXin Long1-0/+11
Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: c0d8bab6ae51 ("sctp: add get and set sockopt for reconf_enable") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-31net/ipv6: Expand and rename accept_unsolicited_na to accept_untracked_naArun Ajith S1-14/+9
RFC 9131 changes default behaviour of handling RX of NA messages when the corresponding entry is absent in the neighbour cache. The current implementation is limited to accept just unsolicited NAs. However, the RFC is more generic where it also accepts solicited NAs. Both types should result in adding a STALE entry for this case. Expand accept_untracked_na behaviour to also accept solicited NAs to be compliant with the RFC and rename the sysctl knob to accept_untracked_na. Fixes: f9a2fb73318e ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131") Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni1-0/+7
drivers/net/ethernet/microchip/lan966x/lan966x_main.c d08ed852560e ("net: lan966x: Make sure to release ptp interrupt") c8349639324a ("net: lan966x: Add FDMA functionality") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-20doc/ip-sysctl: add bc_forwardingNicolas Dichtel1-0/+7
Let's describe this sysctl. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17net/ipv6: Introduce accept_unsolicited_na knob to implement router-side ↵Arun Ajith S1-0/+27
changes for RFC9131 Add a new neighbour cache entry in STALE state for routers on receiving an unsolicited (gratuitous) neighbour advertisement with target link-layer-address option specified. This is similar to the arp_accept configuration for IPv4. A new sysctl endpoint is created to turn on this behaviour: /proc/sys/net/ipv6/conf/interface/accept_unsolicited_na. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-10tcp: adjust TSO packet sizes based on min_rttEric Dumazet1-0/+23
Back when tcp_tso_autosize() and TCP pacing were introduced, our focus was really to reduce burst sizes for long distance flows. The simple heuristic of using sk_pacing_rate/1024 has worked well, but can lead to too small packets for hosts in the same rack/cluster, when thousands of flows compete for the bottleneck. Neal Cardwell had the idea of making the TSO burst size a function of both sk_pacing_rate and tcp_min_rtt() Indeed, for local flows, sending bigger bursts is better to reduce cpu costs, as occasional losses can be repaired quite fast. This patch is based on Neal Cardwell implementation done more than two years ago. bbr is adjusting max_pacing_rate based on measured bandwidth, while cubic would over estimate max_pacing_rate. /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable this new feature, in logarithmic steps. Tested: 100Gbit NIC, two hosts in the same rack, 4K MTU. 600 flows rate-limited to 20000000 bytes per second. Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO) ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96005 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 65,945.29 msec task-clock # 2.845 CPUs utilized 1,314,632 context-switches # 19935.279 M/sec 5,292 cpu-migrations # 80.249 M/sec 940,641 page-faults # 14264.023 M/sec 201,117,030,926 cycles # 3049769.216 GHz (83.45%) 17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%) 136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%) 53,809,530,436 instructions # 0.27 insn per cycle # 2.54 stalled cycles per insn (83.36%) 9,062,315,523 branches # 137422329.563 M/sec (83.22%) 153,008,621 branch-misses # 1.69% of all branches (83.32%) 23.182970846 seconds time elapsed TcpInSegs 15648792 0.0 TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet TcpExtTCPDelivered 58654791 0.0 TcpExtTCPDeliveredCE 19 0.0 After patch: ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96046 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 48,982.58 msec task-clock # 2.104 CPUs utilized 186,014 context-switches # 3797.599 M/sec 3,109 cpu-migrations # 63.472 M/sec 941,180 page-faults # 19214.814 M/sec 153,459,763,868 cycles # 3132982.807 GHz (83.56%) 12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%) 120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%) 36,803,672,106 instructions # 0.24 insn per cycle # 3.27 stalled cycles per insn (83.18%) 5,947,266,275 branches # 121417383.427 M/sec (83.64%) 87,984,616 branch-misses # 1.48% of all branches (83.43%) 23.281200256 seconds time elapsed TcpInSegs 1434706 0.0 TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet TcpExtTCPDelivered 58878971 0.0 TcpExtTCPDeliveredCE 9664 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-30Documentation: fix outdated interpretation of ip_no_pmtu_discxu xin1-2/+4
The updating way of pmtu has changed, but documentation is still in the old way. So this patch updates the interpretation of ip_no_pmtu_disc and min_pmtu. See commit 28d35bcdd3925 ("net: ipv4: don't let PMTU updates increase route MTU") Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-05net: udp: correct the document for udp_memMenglong Dong1-4/+2
udp_mem is a vector of 3 INTEGERs, which is used to limit the number of pages allowed for queueing by all UDP sockets. However, sk_has_memory_pressure() in __sk_mem_raise_allocated() always return false for udp, as memory pressure is not supported by udp, which means that __sk_mem_raise_allocated() will fail once pages allocated for udp socket exceeds udp_mem[0]. Therefor, udp_mem[0] is the only one that limit the number of pages. However, the document of udp_mem just express that udp_mem[2] is the limitation. So, just fix it. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-02net: ndisc: introduce ndisc_evict_nocarrier sysctl parameterJames Prestwood1-0/+9
In most situations the neighbor discovery cache should be cleared on a NOCARRIER event which is currently done unconditionally. But for wireless roams the neighbor discovery cache can and should remain intact since the underlying network has not changed. This patch introduces a sysctl option ndisc_evict_nocarrier which can be disabled by a wireless supplicant during a roam. This allows packets to be sent after a roam immediately without having to wait for neighbor discovery. A user reported roughly a 1 second delay after a roam before packets could be sent out (note, on IPv4). This delay was due to the ARP cache being cleared. During testing of this same scenario using IPv6 no delay was noticed, but regardless there is no reason to clear the ndisc cache for wireless roams. Signed-off-by: James Prestwood <prestwoj@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-02net: arp: introduce arp_evict_nocarrier sysctl parameterJames Prestwood1-0/+9
This change introduces a new sysctl parameter, arp_evict_nocarrier. When set (default) the ARP cache will be cleared on a NOCARRIER event. This new option has been defaulted to '1' which maintains existing behavior. Clearing the ARP cache on NOCARRIER is relatively new, introduced by: commit 859bd2ef1fc1110a8031b967ee656c53a6260a76 Author: David Ahern <dsahern@gmail.com> Date: Thu Oct 11 20:33:49 2018 -0700 net: Evict neighbor entries on carrier down The reason for this changes is to prevent the ARP cache from being cleared when a wireless device roams. Specifically for wireless roams the ARP cache should not be cleared because the underlying network has not changed. Clearing the ARP cache in this case can introduce significant delays sending out packets after a roam. A user reported such a situation here: https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/ After some investigation it was found that the kernel was holding onto packets until ARP finished which resulted in this 1 second delay. It was also found that the first ARP who-has was never responded to, which is actually what caues the delay. This change is more or less working around this behavior, but again, there is no reason to clear the cache on a roam anyways. As for the unanswered who-has, we know the packet made it OTA since it was seen while monitoring. Why it never received a response is unknown. In any case, since this is a problem on the AP side of things all that can be done is to work around it until it is solved. Some background on testing/reproducing the packet delay: Hardware: - 2 access points configured for Fast BSS Transition (Though I don't see why regular reassociation wouldn't have the same behavior) - Wireless station running IWD as supplicant - A device on network able to respond to pings (I used one of the APs) Procedure: - Connect to first AP - Ping once to establish an ARP entry - Start a tcpdump - Roam to second AP - Wait for operstate UP event, and note the timestamp - Start pinging Results: Below is the tcpdump after UP. It was recorded the interface went UP at 10:42:01.432875. 10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28 10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28 10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46 10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64 10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64 10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64 10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64 10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64 10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64 10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64 10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64 10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64 10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64 10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64 10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64 You can see the first ARP who-has went out very quickly after UP, but was never responded to. Nearly a second later the kernel retries and gets a response. Only then do the ping packets go out. If an ARP entry is manually added prior to UP (after the cache is cleared) it is seen that the first ping is never responded to, so its not only an issue with ARP but with data packets in general. As mentioned prior, the wireless interface was also monitored to verify the ping/ARP packet made it OTA which was observed to be true. Signed-off-by: James Prestwood <prestwoj@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-23tcp: remove sk_{tr}x_skb_cacheEric Dumazet1-8/+0
This reverts the following patches : - commit 2e05fcae83c4 ("tcp: fix compile error if !CONFIG_SYSCTL") - commit 4f661542a402 ("tcp: fix zerocopy and notsent_lowat issues") - commit 472c2e07eef0 ("tcp: add one skb cache for tx") - commit 8b27dae5a2e8 ("tcp: add one skb cache for rx") Having a cache of one skb (in each direction) per TCP socket is fragile, since it can cause a significant increase of memory needs, and not good enough for high speed flows anyway where more than one skb is needed. We want instead to add a generic infrastructure, with more flexible per-cpu caches, for alien NUMA nodes. Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+1
Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22tcp: disable TFO blackhole logic by defaultWei Wang1-1/+1
Multiple complaints have been raised from the TFO users on the internet stating that the TFO blackhole logic is too aggressive and gets falsely triggered too often. (e.g. https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/) Considering that most middleboxes no longer drop TFO packets, we decide to disable the blackhole logic by setting /proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_set to 0 by default. Fixes: cf1ef3f0719b4 ("net/tcp_fastopen: Disable active side TFO in certain scenarios") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21ipv6: ioam: Documentation for new IOAM sysctlsJustin Iurman1-0/+17
Add documentation for new IOAM sysctls: - ioam6_id and ioam6_id_wide: two per-namespace sysctls - ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls Example of IOAM configuration based on the following simple topology: _____ _____ _____ | | eth0 eth0 | | eth1 eth0 | | | A |.----------.| B |.----------.| C | |_____| |_____| |_____| 1) Node and interface IDs can be configured for IOAM: # IOAM ID of A = 1, IOAM ID of A.eth0 = 11 (A) sysctl -w net.ipv6.ioam6_id=1 (A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11 # IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22 (B) sysctl -w net.ipv6.ioam6_id=2 (B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21 (B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22 # IOAM ID of C = 3, IOAM ID of C.eth0 = 31 (C) sysctl -w net.ipv6.ioam6_id=3 (C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31 Note that "_wide" IDs equivalents can be configured the same way. 2) Each node can be configured to form an IOAM domain. For instance, we allow IOAM from A to C only (not the reverse path), i.e. enable IOAM on ingress for B.eth0 and C.eth0: (B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1 (C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1 3) An IOAM domain (e.g. ID=123) is defined and made known to each node: (A) ip ioam namespace add 123 (B) ip ioam namespace add 123 (C) ip ioam namespace add 123 4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent by A when C (e.g. db02::2) is the destination: (A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123 size 12 dev eth0 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24sctp: send the next probe immediately once the last one is ackedXin Long1-4/+8
These is no need to wait for 'interval' period for the next probe if the last probe is already acked in search state. The 'interval' period waiting should be only for probe failure timeout and the current pmtu check when it's in search complete state. This change will shorten the probe time a lot in search state, and also fix the document accordingly. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22sctp: add probe_interval in sysctl and sock/asoc/transportXin Long1-0/+8
PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'. 'n' is the interval for PLPMTUD probe timer in milliseconds, and it can't be less than 5000 if it's not 0. All asoc/transport's PLPMTUD in a new socket will be enabled by default. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15net: Introduce net.ipv4.tcp_migrate_req.Kuniyuki Iwashima1-0/+25
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
2021-05-18ipv6: Add custom multipath hash policyIdo Schimmel1-0/+2
Add a new multipath hash policy where the packet fields used for hash calculation are determined by user space via the fib_multipath_hash_fields sysctl that was introduced in the previous patch. The current set of available packet fields includes both outer and inner fields, which requires two invocations of the flow dissector. Avoid unnecessary dissection of the outer or inner flows by skipping dissection if none of the outer or inner fields are required. In accordance with the existing policies, when an skb is not available, packet fields are extracted from the provided flow key. In which case, only outer fields are considered. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv6: Add a sysctl to control multipath hash fieldsIdo Schimmel1-0/+27
A subsequent patch will add a new multipath hash policy where the packet fields used for multipath hash calculation are determined by user space. This patch adds a sysctl that allows user space to set these fields. The packet fields are represented using a bitmask and are common between IPv4 and IPv6 to allow user space to use the same numbering across both protocols. For example, to hash based on standard 5-tuple: # sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037 net.ipv6.fib_multipath_hash_fields = 0x0037 To avoid introducing holes in 'struct netns_sysctl_ipv6', move the 'bindv6only' field after the multipath hash fields. The kernel rejects unknown fields, for example: # sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000 sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv4: Add custom multipath hash policyIdo Schimmel1-0/+2
Add a new multipath hash policy where the packet fields used for hash calculation are determined by user space via the fib_multipath_hash_fields sysctl that was introduced in the previous patch. The current set of available packet fields includes both outer and inner fields, which requires two invocations of the flow dissector. Avoid unnecessary dissection of the outer or inner flows by skipping dissection if none of the outer or inner fields are required. In accordance with the existing policies, when an skb is not available, packet fields are extracted from the provided flow key. In which case, only outer fields are considered. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv4: Add a sysctl to control multipath hash fieldsIdo Schimmel1-0/+27
A subsequent patch will add a new multipath hash policy where the packet fields used for multipath hash calculation are determined by user space. This patch adds a sysctl that allows user space to set these fields. The packet fields are represented using a bitmask and are common between IPv4 and IPv6 to allow user space to use the same numbering across both protocols. For example, to hash based on standard 5-tuple: # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037 net.ipv4.fib_multipath_hash_fields = 0x0037 The kernel rejects unknown fields, for example: # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000 sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument More fields can be added in the future, if needed. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-15/+0
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c - keep the ZC code, drop the code related to reinit net/bridge/netfilter/ebtables.c - fix build after move to net_generic Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-14doc: move seg6_flowlabel to seg6-sysctl.rstNicolas Dichtel1-15/+0
Let's have all seg6 sysctl at the same place. Fixes: a6dc6670cd7e ("ipv6: sr: Add documentation for seg_flowlabel sysctl") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02net: document a side effect of ip_local_reserved_portsOtto Hollmann1-1/+3
If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723 int __inet_hash_connect( ... for (i = 0; i < remaining; i += 2, port += 2) { if (unlikely(port >= high)) port -= remaining; if (inet_is_local_reserved_port(net, port)) continue; E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others. If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter. Signed-off-by: Otto Hollmann <otto.hollmann@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30net: add sysctl for enabling RFC 8335 PROBE messagesAndreas Roeseler1-0/+6
Section 8 of RFC 8335 specifies potential security concerns of responding to PROBE requests, and states that nodes that support PROBE functionality MUST be able to enable/disable responses and that responses MUST be disabled by default Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-4/+3
2021-02-12tcp: fix tcp_rmem documentationEric Dumazet1-4/+3
tcp_rmem[1] has been changed to 131072, we should update the documentation to reflect this. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Zhibin Liu <zhibinliu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09Documentation: networking: ip-sysctl: Document src_valid_mark sysctlJay Vosburgh1-0/+19
Provide documentation for src_valid_mark sysctl, which was added in commit 28f6aeea3f12 ("net: restore ip source validation"). Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09IPv6: Extend 'fib_notify_on_flag_change' sysctlAmit Cohen1-1/+2
Add the value '2' to 'fib_notify_on_flag_change' to allow sending notifications only for failed route installation. Separate value is added for such notifications because there are less of them, so they do not impact performance and some users will find them more important. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09IPv4: Extend 'fib_notify_on_flag_change' sysctlAmit Cohen1-1/+2
Add the value '2' to 'fib_notify_on_flag_change' to allow sending notifications only for failed route installation. Separate value is added for such notifications because there are less of them, so they do not impact performance and some users will find them more important. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-03net: ipv6: Emit notification when fib hardware flags are changedAmit Cohen1-0/+20
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. It is also possible for a route already installed in hardware to change its action and therefore its flags. For example, a host route that is trapping packets can be "promoted" to perform decapsulation following the installation of an IPinIP/VXLAN tunnel. Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed. The aim is to provide an indication to user-space (e.g., routing daemons) about the state of the route in hardware. Introduce a sysctl that controls this behavior. Keep the default value at 0 (i.e., do not emit notifications) for several reasons: - Multiple RTM_NEWROUTE notification per-route might confuse existing routing daemons. - Convergence reasons in routing daemons. - The extra notifications will negatively impact the insertion rate. - Not all users are interested in these notifications. Move fib6_info_hw_flags_set() to C file because it is no longer a short function. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03net: ipv4: Emit notification when fib hardware flags are changedAmit Cohen1-0/+20
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. It is also possible for a route already installed in hardware to change its action and therefore its flags. For example, a host route that is trapping packets can be "promoted" to perform decapsulation following the installation of an IPinIP/VXLAN tunnel. Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed. The aim is to provide an indication to user-space (e.g., routing daemons) about the state of the route in hardware. Introduce a sysctl that controls this behavior. Keep the default value at 0 (i.e., do not emit notifications) for several reasons: - Multiple RTM_NEWROUTE notification per-route might confuse existing routing daemons. - Convergence reasons in routing daemons. - The extra notifications will negatively impact the insertion rate. - Not all users are interested in these notifications. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02docs: networking: swap words in icmp_errors_use_inbound_ifaddr docVincent Bernat1-1/+1
Signed-off-by: Vincent Bernat <vincent@bernat.ch> Link: https://lore.kernel.org/r/20210130190518.854806-1-vincent@bernat.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+12
drivers/net/can/dev.c b552766c872f ("can: dev: prevent potential information leak in can_fill_info()") 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") 0a042c6ec991 ("can: dev: move netlink related code into seperate file") Code move. drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down") 214baf22870c ("net/mlx5e: Support HTB offload") Adjacent code changes net/switchdev/switchdev.c 20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP") ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers") bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes") Transaction parameter gets dropped otherwise keep the fix. Signed-off-by: Jakub Kicinski <kuba@kernel.org>