summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2023-10-18netfilter: xt_mangle: only check verdict part of return valueFlorian Westphal1-4/+5
These checks assume that the caller only returns NF_DROP without any errno embedded in the upper bits. This is fine right now, but followup patches will start to propagate such errors to allow kfree_skb_drop_reason() in the called functions, those would then indicate 'errno << 8 | NF_STOLEN'. To not break things we have to mask those parts out. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-17Merge tag 'for-netdev' of ↵Jakub Kicinski4-9/+13
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17tcp: Set pingpong threshold via sysctlHaiyang Zhang3-2/+12
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16ipv4: use tunnel flow flags for tunnel route lookupsBeniamino Galvani1-0/+1
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a new field to struct ip_tunnel_key to control route lookups. Currently the flag is used by vxlan and geneve tunnels; use it also in udp_tunnel_dst_lookup() so that it affects all tunnel types relying on this function. Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: add new arguments to udp_tunnel_dst_lookup()Beniamino Galvani1-13/+13
We want to make the function more generic so that it can be used by other UDP tunnel implementations such as geneve and vxlan. To do that, add the following arguments: - source and destination UDP port; - ifindex of the output interface, needed by vxlan; - the tos, because in some cases it is not taken from struct ip_tunnel_info (for example, when it's inherited from the inner packet); - the dst cache, because not all tunnel types (e.g. vxlan) want to use the one from struct ip_tunnel_info. With these parameters, the function no longer needs the full struct ip_tunnel_info as argument and we can pass only the relevant part of it (struct ip_tunnel_key). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: remove "proto" argument from udp_tunnel_dst_lookup()Beniamino Galvani1-2/+2
The function is now UDP-specific, the protocol is always IPPROTO_UDP. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: rename and move ip_route_output_tunnel()Beniamino Galvani2-48/+48
At the moment ip_route_output_tunnel() is used only by bareudp. Ideally, other UDP tunnel implementations should use it, but to do so the function needs to accept new parameters that are specific for UDP tunnels, such as the ports. Prepare for these changes by renaming the function to udp_tunnel_dst_lookup() and move it to file net/ipv4/udp_tunnel_core.c. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not setMartin KaFai Lau1-1/+1
It was reported that there is a compiler warning on the unused variable "sin_addr_len" in af_inet.c when CONFIG_CGROUP_BPF is not set. This patch is to address it similar to the ipv6 counterpart in inet6_getname(). It is to "return sin_addr_len;" instead of "return sizeof(*sin);". Fixes: fefba7d1ae19 ("bpf: Propagate modified uaddrlen from cgroup sockaddr programs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/bpf/20231013185702.3993710-1-martin.lau@linux.dev Closes: https://lore.kernel.org/bpf/20231013114007.2fb09691@canb.auug.org.au/
2023-10-13net: fix IPSTATS_MIB_OUTFORWDATAGRAMS increment after fragment checkHeng Guo1-2/+2
Reproduce environment: network with 3 VM linuxs is connected as below: VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3 VM1: eth0 ip: 192.168.122.207 MTU 1800 VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500 VM3: eth0 ip: 192.168.123.240 MTU 1800 Reproduce: VM1 send 1600 bytes UDP data to VM3 using tools scapy with flags='DF'. scapy command: send(IP(dst="192.168.123.240",flags='DF')/UDP()/str('0'*1600),count=1, inter=1.000000) Result: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 2 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is always keeping 2 without increment. Issue description and patch: ip_exceeds_mtu() in ip_forward() drops this IP datagram because skb len (1600 sending by scapy) is over MTU(1500 in VM2) if "DF" is set. According to RFC 4293 "3.2.3. IP Statistics Tables", +-------+------>------+----->-----+----->-----+ | InForwDatagrams (6) | OutForwDatagrams (6) | | V +->-+ OutFragReqds | InNoRoutes | | (packets) / (local packet (3) | | | IF is that of the address | +--> OutFragFails | and may not be the receiving IF) | | (packets) the IPSTATS_MIB_OUTFORWDATAGRAMS should be counted before fragment check. The existing implementation, instead, would incease the counter after fragment check: ip_exceeds_mtu() in ipv4 and ip6_pkt_too_big() in ipv6. So do patch to move IPSTATS_MIB_OUTFORWDATAGRAMS counter to ip_forward() for ipv4 and ip6_forward() for ipv6. Test result with patch: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 3 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is updated from 2 to 3. Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Signed-off-by: Heng Guo <heng.guo@windriver.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231011015137.27262-1-heng.guo@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+1
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: kernel/bpf/verifier.c 829955981c55 ("bpf: Fix verifier log for async callback return values") a923819fb2c5 ("bpf: Treat first argument as return value for bpf_throw") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-12net: tcp: fix crashes trying to free half-baked MTU probesJakub Kicinski1-0/+1
tcp_stream_alloc_skb() initializes the skb to use tcp_tsorted_anchor which is a union with the destructor. We need to clean that TCP-iness up before freeing. Fixes: 736013292e3c ("tcp: let tcp_mtu_probe() build headless packets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231010173651.3990234-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-12bpf: Propagate modified uaddrlen from cgroup sockaddr programsDaan De Meyer4-8/+12
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents. __cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-10tcp: change data receiver flowlabel after one dupDavid Morley1-3/+11
This commit changes the data receiver repath behavior to occur after receiving a single duplicate. This can help recover ACK connectivity quicker if a TLP was sent along a nonworking path. For instance, consider the case where we have an initially nonworking forward path and reverse path and subsequently switch to only working forward paths. Before this patch we would have the following behavior. +---------+--------+--------+----------+----------+----------+ | Event | For FL | Rev FL | FP Works | RP Works | Data Del | +---------+--------+--------+----------+----------+----------+ | Initial | A | 1 | N | N | 0 | +---------+--------+--------+----------+----------+----------+ | TLP | A | 1 | N | N | 0 | +---------+--------+--------+----------+----------+----------+ | RTO 1 | B | 1 | Y | N | 1 | +---------+--------+--------+----------+----------+----------+ | RTO 2 | C | 1 | Y | N | 2 | +---------+--------+--------+----------+----------+----------+ | RTO 3 | D | 2 | Y | Y | 3 | +---------+--------+--------+----------+----------+----------+ This patch gets rid of at least RTO 3, avoiding additional unnecessary repaths of a working forward path to a (potentially) nonworking one. In addition, this commit changes the behavior to avoid repathing upon rx of duplicate data if the local endpoint is in CA_Loss (in which case the RTOs will already be changing the outgoing flowlabel). Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10tcp: record last received ipv6 flowlabelDavid Morley3-3/+18
In order to better estimate whether a data packet has been retransmitted or is the result of a TLP, we save the last received ipv6 flowlabel. To make space for this field we resize the "ato" field in inet_connection_sock as the current value of TCP_DELACK_MAX can be fully contained in 8 bits and add a compile_time_assert ensuring this field is the required size. v2: addressed kernel bot feedback about dccp_delack_timer() v3: addressed build error introduced by commit bbf80d713fe7 ("tcp: derive delack_max from rto_min") Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski7-13/+28
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05tcp: fix delayed ACKs for MSS boundary conditionNeal Cardwell1-0/+13
This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads data when > 1*MSS is unacknowledged, then tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || (2) If an app reads all received data, and the packets were < 1*MSS, and either (a) the app is not ping-pong or (b) we received two packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) *However*: if an app reads exactly 1*MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1*MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1*MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1*MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1*MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Xin Guo <guoxin0309@gmail.com> Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05tcp: fix quick-ack counting to count actual ACKs of new dataNeal Cardwell1-4/+3
This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04ipv4: Set offload_failed flag in fibmatch resultsBenjamin Poirier1-0/+2
Due to a small omission, the offload_failed flag is missing from ipv4 fibmatch results. Make sure it is set correctly. The issue can be witnessed using the following commands: echo "1 1" > /sys/bus/netdevsim/new_device ip link add dummy1 up type dummy ip route add 192.0.2.0/24 dev dummy1 echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload ip route add 198.51.100.0/24 dev dummy1 ip route # 192.168.15.0/24 has rt_trap # 198.51.100.0/24 has rt_offload_failed ip route get 192.168.15.1 fibmatch # Result has rt_trap ip route get 198.51.100.1 fibmatch # Result differs from the route shown by `ip route`, it is missing # rt_offload_failed ip link del dev dummy1 echo 1 > /sys/bus/netdevsim/del_device Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04Merge tag 'for-netdev' of ↵Jakub Kicinski2-9/+5
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-10-02 We've added 11 non-merge commits during the last 12 day(s) which contain a total of 12 files changed, 176 insertions(+), 41 deletions(-). The main changes are: 1) Fix BPF verifier to reset backtrack_state masks on global function exit as otherwise subsequent precision tracking would reuse them, from Andrii Nakryiko. 2) Several sockmap fixes for available bytes accounting, from John Fastabend. 3) Reject sk_msg egress redirects to non-TCP sockets given this is only supported for TCP sockets today, from Jakub Sitnicki. 4) Fix a syzkaller splat in bpf_mprog when hitting maximum program limits with BPF_F_BEFORE directive, from Daniel Borkmann and Nikolay Aleksandrov. 5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust size_index for selecting a bpf_mem_cache, from Hou Tao. 6) Fix arch_prepare_bpf_trampoline return code for s390 JIT, from Song Liu. 7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off, from Leon Hwang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Use kmalloc_size_roundup() to adjust size_index selftest/bpf: Add various selftests for program limits bpf, mprog: Fix maximum program check on mprog attachment bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets bpf, sockmap: Add tests for MSG_F_PEEK bpf, sockmap: Do not inc copied_seq when PEEK flag set bpf: tcp_read_skb needs to pop skb regardless of seq bpf: unconditionally reset backtrack_state masks on global func exit bpf: Fix tr dereferencing selftests/bpf: Check bpf_cubic_acked() is called via struct_ops s390/bpf: Let arch_prepare_bpf_trampoline return program size ==================== Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-03udp_tunnel: Use flex array to simplify codeChristophe JAILLET1-9/+2
'n_tables' is small, UDP_TUNNEL_NIC_MAX_TABLES = 4 as a maximum. So there is no real point to allocate the 'entries' pointers array with a dedicate memory allocation. Using a flexible array for struct udp_tunnel_nic->entries avoids the overhead of an additional memory allocation. This also saves an indirection when the array is accessed. Finally, __counted_by() can be used for run-time bounds checking if configured and supported by the compiler. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/4a096ba9cf981a588aa87235bb91e933ee162b3d.1695542544.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: optimize tcp_metrics_flush_all()Eric Dumazet1-2/+5
This is inspired by several syzbot reports where tcp_metrics_flush_all() was seen in the traces. We can avoid acquiring tcp_metrics_lock for empty buckets, and we should add one cond_resched() to break potential long loops. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: do not create an entry from tcp_init_metrics()Eric Dumazet1-1/+1
tcp_init_metrics() only wants to get metrics if they were previously stored in the cache. Creating an entry is adding useless costs, especially when tcp_no_metrics_save is set. Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: properly set tp->snd_ssthresh in tcp_init_metrics()Eric Dumazet1-5/+4
We need to set tp->snd_ssthresh to TCP_INFINITE_SSTHRESH in the case tcp_get_metrics() fails for some reason. Fixes: 9ad7c049f0f7 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: add missing barriers on deleteEric Dumazet1-2/+2
When removing an item from RCU protected list, we must prevent store-tearing, using rcu_assign_pointer() or WRITE_ONCE(). Fixes: 04f721c671656 ("tcp_metrics: Rewrite tcp_metrics_flush_all") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03ipv4/fib: send notify when delete source address routesHangbin Liu2-0/+5
After deleting an interface address in fib_del_ifaddr(), the function scans the fib_info list for stray entries and calls fib_flush() and fib_table_flush(). Then the stray entries will be deleted silently and no RTM_DELROUTE notification will be sent. This lack of notification can make routing daemons, or monitor like `ip monitor route` miss the routing changes. e.g. + ip link add dummy1 type dummy + ip link add dummy2 type dummy + ip link set dummy1 up + ip link set dummy2 up + ip addr add 192.168.5.5/24 dev dummy1 + ip route add 7.7.7.0/24 dev dummy2 src 192.168.5.5 + ip -4 route 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 As Ido reminded, fib_table_flush() isn't only called when an address is deleted, but also when an interface is deleted or put down. The lack of notification in these cases is deliberate. And commit 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message on device down") introduced a sysctl to make IPv6 behave like IPv4 in this regard. So we can't send the route delete notify blindly in fib_table_flush(). To fix this issue, let's add a new flag in "struct fib_info" to track the deleted prefer source address routes, and only send notify for them. After update: + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 Deleted 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 Suggested-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230922075508.848925-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-01inet: implement lockless getsockopt(IP_MULTICAST_IF)Eric Dumazet5-21/+20
Add missing annotations to inet->mc_index and inet->mc_addr to fix data-races. getsockopt(IP_MULTICAST_IF) can be lockless. setsockopt() side is left for later. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless IP_PKTOPTIONS implementationEric Dumazet1-39/+37
Current implementation is already lockless, because the socket lock is released before reading socket fields. Add missing READ_ONCE() annotations. Note that corresponding WRITE_ONCE() are needed, the order of the patches do not really matter. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless getsockopt(IP_UNICAST_IF)Eric Dumazet5-18/+21
Add missing READ_ONCE() annotations when reading inet->uc_index Implementing getsockopt(IP_UNICAST_IF) locklessly seems possible, the setsockopt() part might not be possible at the moment. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless getsockopt(IP_MTU)Eric Dumazet1-11/+9
sk_dst_get() does not require socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless getsockopt(IP_OPTIONS)Eric Dumazet1-10/+10
inet->inet_opt being RCU protected, we can use RCU instead of locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_TOSEric Dumazet4-24/+20
Some reads of inet->tos are racy. Add needed READ_ONCE() annotations and convert IP_TOS option lockless. v2: missing changes in include/net/route.h (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MTU_DISCOVEREric Dumazet5-17/+13
inet->pmtudisc can be read locklessly. Implement proper lockless reads and writes to inet->pmtudisc ip_sock_set_mtu_discover() can now be called from arbitrary contexts. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MULTICAST_TTLEric Dumazet2-16/+17
inet->mc_ttl can be read locklessly. Implement proper lockless reads and writes to inet->mc_ttl Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: annotate data-races around sk->sk_dst_pending_confirmEric Dumazet1-1/+1
This field can be read or written without socket lock being held. Add annotations to avoid load-store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_MAX_PACING_RATEEric Dumazet3-12/+14
SO_MAX_PACING_RATE setsockopt() does not need to hold the socket lock, because sk->sk_pacing_rate readers can run fine if the value is changed by other threads, after adding READ_ONCE() accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_PRIORITYEric Dumazet4-4/+4
This is a followup of 8bf43be799d4 ("net: annotate data-races around sk->sk_priority"). sk->sk_priority can be read and written without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01tcp: derive delack_max from rto_minEric Dumazet2-2/+17
While BPF allows to set icsk->->icsk_delack_max and/or icsk->icsk_rto_min, we have an ip route attribute (RTAX_RTO_MIN) to be able to tune rto_min, but nothing to consequently adjust max delayed ack, which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). This makes RTAX_RTO_MIN of almost no practical use, unless customers are in big trouble. Modern days datacenter communications want to set rto_min to ~5 ms, and the max delayed ack one jiffie smaller to avoid spurious retransmits. After this patch, an "rto_min 5" route attribute will effectively lower max delayed ack timers to 4 ms. Note in the following ss output, "rto:6 ... ato:4" $ ss -temoi dst XXXXXX State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 ino:255134 sk:1001 <-> skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 bytes_received:54823120 segs_out:1370582 segs_in:1370580 data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 While we could argue this patch fixes a bug with RTAX_RTO_MIN, I do not add a Fixes: tag, so that we can soak it a bit before asking backports to stable branches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-29bpf, sockmap: Do not inc copied_seq when PEEK flag setJohn Fastabend1-1/+3
When data is peek'd off the receive queue we shouldn't considered it copied from tcp_sock side. When we increment copied_seq this will confuse tcp_data_ready() because copied_seq can be arbitrarily increased. From application side it results in poll() operations not waking up when expected. Notice tcp stack without BPF recvmsg programs also does not increment copied_seq. We broke this when we moved copied_seq into recvmsg to only update when actual copy was happening. But, it wasn't working correctly either before because the tcp_data_ready() tried to use the copied_seq value to see if data was read by user yet. See fixes tags. Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230926035300.135096-3-john.fastabend@gmail.com
2023-09-29bpf: tcp_read_skb needs to pop skb regardless of seqJohn Fastabend1-8/+2
Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq value. This (as described in the commit) would cause an error for apps because once that is incremented the application might believe there is no data to be read. Then some apps would stall or abort believing no data is available. However, the fix is incomplete because it introduces another issue in the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume as many skbs as possible. The problem is the call is ... tcp_recv_skb(sk, seq, &offset) ... where 'seq' is: u32 seq = tp->copied_seq; Now we can hit a case where we've yet incremented copied_seq from BPF side, but then tcp_recv_skb() fails this test ... if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) ... so that instead of returning the skb we call tcp_eat_recv_skb() which frees the skb. This is because the routine believes the SKB has been collapsed per comment: /* This looks weird, but this can happen if TCP collapsing * splitted a fat GRO packet, while we released socket lock * in skb_splice_bits() */ This can't happen here we've unlinked the full SKB and orphaned it. Anyways it would confuse any BPF programs if the data were suddenly moved underneath it. To fix this situation do simpler operation and just skb_peek() the data of the queue followed by the unlink. It shouldn't need to check this condition and tcp_read_skb() reads entire skbs so there is no need to handle the 'offset!=0' case as we would see in tcp_read_sock(). Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230926035300.135096-2-john.fastabend@gmail.com
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni1-1/+3
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-17ipv4: fix null-deref in ipv4_link_failureKyle Zeng1-1/+3
Currently, we assume the skb is associated with a device before calling __ip_options_compile, which is not always the case if it is re-routed by ipvs. When skb->dev is NULL, dev_net(skb->dev) will become null-dereference. This patch adds a check for the edge case and switch to use the net_device from the rtable when skb->dev is NULL. Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Cc: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16tcp: new TCP_INFO stats for RTO eventsAananth V4-2/+43
The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can effectively reduce application disruption during outages. To better measure the efficacy of this feature, this patch adds three more detailed stats during RTO recovery and exports via TCP_INFO. Applications and monitoring systems can leverage this data to measure the network path diversity and end-to-end repair latency during network outages to improve their network infrastructure. The following counters are added to tcp_sock in order to track RTO events over the lifetime of a TCP socket. 1. u16 total_rto - Counts the total number of RTO timeouts. 2. u16 total_rto_recoveries - Counts the total number of RTO recoveries. 3. u32 total_rto_time - Counts the total time spent (ms) in RTO recoveries. (time spent in CA_Loss and CA_Recovery states) To compute total_rto_time, we add a new u32 rto_stamp field to tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO recovery (CA_Loss). Corresponding fields are also added to the tcp_info struct. Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKedAananth V1-4/+5
For passive TCP Fast Open sockets that had SYN/ACK timeout and did not send more data in SYN_RECV, upon receiving the final ACK in 3WHS, the congestion state may awkwardly stay in CA_Loss mode unless the CA state was undone due to TCP timestamp checks. However, if tcp_rcv_synrecv_state_fastopen() decides not to undo, then we should enter CA_Open, because at that point we have received an ACK covering the retransmitted SYNACKs. Currently, the icsk_ca_state is only set to CA_Open after we receive an ACK for a data-packet. This is because tcp_ack does not call tcp_fastretrans_alert (and tcp_process_loss) if !prior_packets Note that tcp_process_loss() calls tcp_try_undo_recovery(), so having tcp_rcv_synrecv_state_fastopen() decide that if we're in CA_Loss we should call tcp_try_undo_recovery() is consistent with that, and low risk. Fixes: dad8cea7add9 ("tcp: fix TFO SYNACK undo to avoid double-timestamp-undo") Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-15tcp: indent an if statementDan Carpenter1-1/+1
Indent this if statement one tab. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-15ipv6: lockless IPV6_FLOWINFO_SEND implementationEric Dumazet1-2/+1
np->sndflow reads are racy. Use one bit ftom atomic inet->inet_flags instead, IPV6_FLOWINFO_SEND setsockopt() can be lockless. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-15ipv6: lockless IPV6_RECVERR implemetationEric Dumazet1-1/+1
np->recverr is moved to inet->inet_flags to fix data-races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni1-16/+20
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-14ipv4: igmp: Remove redundant comparison in igmp_mcf_get_next()Gavrilov Ilia1-2/+0
The 'state->im' value will always be non-zero after the 'while' statement, so the check can be removed. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230912084039.1501984-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-14udplite: fix various data-racesEric Dumazet1-10/+11
udp->pcflag, udp->pcslen and udp->pcrlen reads/writes are racy. Move udp->pcflag to udp->udp_flags for atomicity, and add READ_ONCE()/WRITE_ONCE() annotations for pcslen and pcrlen. Fixes: ba4e58eca8aa ("[NET]: Supporting UDP-Lite (RFC 3828) in Linux") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-14udplite: remove UDPLITE_BITEric Dumazet1-1/+0
This flag is set but never read, we can remove it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>