summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2024-05-23Merge tag 'net-6.10-rc1' of ↵Linus Torvalds3-3/+16
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
2024-05-23net: relax socket state check at accept time.Paolo Abeni1-1/+3
Christoph reported the following splat: WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0 Modules linked in: CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759 Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80 RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293 RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64 R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000 R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800 FS: 000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786 do_accept+0x435/0x620 net/socket.c:1929 __sys_accept4_file net/socket.c:1969 [inline] __sys_accept4+0x9b/0x110 net/socket.c:1999 __do_sys_accept net/socket.c:2016 [inline] __se_sys_accept net/socket.c:2013 [inline] __x64_sys_accept+0x7d/0x90 net/socket.c:2013 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x4315f9 Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300 R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055 </TASK> The reproducer invokes shutdown() before entering the listener status. After commit 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets"), the above causes the child to reach the accept syscall in FIN_WAIT1 status. Eric noted we can relax the existing assertion in __inet_accept() Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490 Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23tcp: remove 64 KByte limit for initial tp->rcv_wnd valueJason Xing1-1/+1
Recently, we had some servers upgraded to the latest kernel and noticed the indicator from the user side showed worse results than before. It is caused by the limitation of tp->rcv_wnd. In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most CDN teams would not benefit from this change because they cannot have a large window to receive a big packet, which will be slowed down especially in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's the side effect for the latency/time-sensitive users. To avoid future confusion, current change doesn't affect the initial receive window on the wire in a SYN or SYN+ACK packet which are set within 65535 bytes according to RFC 7323 also due to the limit in __tcp_transmit_skb(): th->window = htons(min(tp->rcv_wnd, 65535U)); In one word, __tcp_transmit_skb() already ensures that constraint is respected, no matter how large tp->rcv_wnd is. The change doesn't violate RFC. Let me provide one example if with or without the patch: Before: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=65536 ---> server Note: for the last ACK, the calculation is 512 << 7. After: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=175232 ---> server Note: I use the following command to make it work: ip route change default via [ip] dev eth0 metric 100 initrwnd 120 For the last ACK, the calculation is 1369 << 7. When we apply such a patch, having a large rcv_wnd if the user tweak this knob can help transfer data more rapidly and save some rtts. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21tcp: Fix shift-out-of-bounds in dctcp_update_alpha().Kuniyuki Iwashima1-1/+12
In dctcp_update_alpha(), we use a module parameter dctcp_shift_g as follows: alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g); ... delivered_ce <<= (10 - dctcp_shift_g); It seems syzkaller started fuzzing module parameters and triggered shift-out-of-bounds [0] by setting 100 to dctcp_shift_g: memcpy((void*)0x20000080, "/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul, /*flags=*/2ul, /*mode=*/0ul); memcpy((void*)0x20000000, "100\000", 4); syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul); Let's limit the max value of dctcp_shift_g by param_set_uint_minmax(). With this patch: # echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g # cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g 10 # echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g -bash: echo: write error: Invalid argument [0]: UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12 shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int') CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f3561 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114 ubsan_epilogue lib/ubsan.c:231 [inline] __ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468 dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143 tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline] tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948 tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711 tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937 sk_backlog_rcv include/net/sock.h:1106 [inline] __release_sock+0x20f/0x350 net/core/sock.c:2983 release_sock+0x61/0x1f0 net/core/sock.c:3549 mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907 mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976 __mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072 mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127 inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437 __sock_release net/socket.c:659 [inline] sock_close+0xc0/0x240 net/socket.c:1421 __fput+0x41b/0x890 fs/file_table.c:422 task_work_run+0x23b/0x300 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x9c8/0x2540 kernel/exit.c:878 do_group_exit+0x201/0x2b0 kernel/exit.c:1027 __do_sys_exit_group kernel/exit.c:1038 [inline] __se_sys_exit_group kernel/exit.c:1036 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f6c2b5005b6 Code: Unable to access opcode bytes at 0x7f6c2b50058c. RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6 RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001 RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0 R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/ Fixes: e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240517091626.32772-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-18Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linuxLinus Torvalds2-8/+9
Pull more io_uring updates from Jens Axboe: "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept requests. This is very similar to previous work that enabled the same hint for doing receives on sockets. By far the majority of the work here is refactoring to enable the networking side to pass back whether or not the socket had more pending requests after accepting the current one, the last patch just wires it up for io_uring. Not only does this enable applications to know whether there are more connections to accept right now, it also enables smarter logic for io_uring multishot accept on whether to retry immediately or wait for a poll trigger" * tag 'net-accept-more-20240515' of git://git.kernel.dk/linux: io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept net: pass back whether socket was empty post accept net: have do_accept() take a struct proto_accept_arg argument net: change proto and proto_ops accept type
2024-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-6/+14
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14tcp: rstreason: fully support in tcp_check_req()Jason Xing1-1/+1
We're going to send an RST due to invalid syn packet which is already checked whether 1) it is in sequence, 2) it is a retransmitted skb. As RFC 793 says, if the state of socket is not CLOSED/LISTEN/SYN-SENT, then we should send an RST when receiving bad syn packet: "fourth, check the SYN bit,...If the SYN is in the window it is an error, send a reset" Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-6-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14tcp: rstreason: handle timewait cases in the receive pathJason Xing1-1/+1
There are two possible cases where TCP layer can send an RST. Since they happen in the same place, I think using one independent reason is enough to identify this special situation. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14net: pass back whether socket was empty post acceptJens Axboe1-0/+1
This adds an 'is_empty' argument to struct proto_accept_arg, which can be used to pass back information on whether or not the given socket has more connections to accept post the one just accepted. To utilize this information, the caller should initialize the 'is_empty' field to, eg, -1 and then check for 0/1 after the accept. If the field has been set, the caller knows whether there are more pending connections or not. If the field remains -1 after the accept call, the protocol doesn't support passing back this information. This patch wires it up for ipv4/6 TCP. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14net: change proto and proto_ops accept typeJens Axboe2-8/+8
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14Merge tag 'for-netdev' of ↵Jakub Kicinski3-3/+7
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14netlabel: fix RCU annotation for IPv4 options on socket creationDavide Caratti1-3/+4
Xiumei reports the following splat when netlabel and TCP socket are used: ============================= WARNING: suspicious RCU usage 6.9.0-rc2+ #637 Not tainted ----------------------------- net/ipv4/cipso_ipv4.c:1880 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ncat/23333: #0: ffffffff906030c0 (rcu_read_lock){....}-{1:2}, at: netlbl_sock_setattr+0x25/0x1b0 stack backtrace: CPU: 11 PID: 23333 Comm: ncat Kdump: loaded Not tainted 6.9.0-rc2+ #637 Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0 07/26/2013 Call Trace: <TASK> dump_stack_lvl+0xa9/0xc0 lockdep_rcu_suspicious+0x117/0x190 cipso_v4_sock_setattr+0x1ab/0x1b0 netlbl_sock_setattr+0x13e/0x1b0 selinux_netlbl_socket_post_create+0x3f/0x80 selinux_socket_post_create+0x1a0/0x460 security_socket_post_create+0x42/0x60 __sock_create+0x342/0x3a0 __sys_socket_create.part.22+0x42/0x70 __sys_socket+0x37/0xb0 __x64_sys_socket+0x16/0x20 do_syscall_64+0x96/0x180 ? do_user_addr_fault+0x68d/0xa30 ? exc_page_fault+0x171/0x280 ? asm_exc_page_fault+0x22/0x30 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7fbc0ca3fc1b Code: 73 01 c3 48 8b 0d 05 f2 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 29 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 f1 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fff18635208 EFLAGS: 00000246 ORIG_RAX: 0000000000000029 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fbc0ca3fc1b RDX: 0000000000000006 RSI: 0000000000000001 RDI: 0000000000000002 RBP: 000055d24f80f8a0 R08: 0000000000000003 R09: 0000000000000001 R10: 0000000000020000 R11: 0000000000000246 R12: 000055d24f80f8a0 R13: 0000000000000000 R14: 000055d24f80fb88 R15: 0000000000000000 </TASK> The current implementation of cipso_v4_sock_setattr() replaces IP options under the assumption that the caller holds the socket lock; however, such assumption is not true, nor needed, in selinux_socket_post_create() hook. Let all callers of cipso_v4_sock_setattr() specify the "socket lock held" condition, except selinux_socket_post_create() _ where such condition can safely be set as true even without holding the socket lock. Fixes: f6d8bd051c39 ("inet: add RCU protection to inet->opt") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/f4260d000a3a55b9e8b6a3b4e3fffc7da9f82d41.1715359817.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14tcp: socket option to check for MPTCP fallback to TCPMatthieu Baerts (NGI0)1-0/+3
A way for an application to know if an MPTCP connection fell back to TCP is to use getsockopt(MPTCP_INFO) and look for errors. The issue with this technique is that the same errors -- EOPNOTSUPP (IPv4) and ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the kernel doesn't support this socket option. The userspace then has to look at the kernel version to understand what the errors mean. It is not clean, and it doesn't take into account older kernels where the socket option has been backported. A cleaner way would be to expose this info to the TCP socket level. In case of MPTCP socket where no fallback happened, the socket options for the TCP level will be handled in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value 1, while TCP has to set 0. If the socket option is not supported, one of these two errors will be reported: - EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets - ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the socket received after an 'accept()', when the client didn't request to use MPTCP: this socket will be a TCP one, even if the listen socket was an MPTCP one. With this new option, the kernel can return a clear answer to both "Is this kernel new enough to tell me the fallback status?" and "If it is new enough, is it currently a TCP or MPTCP socket?" questions, while not breaking the previous method. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segmentRichard Gobert3-62/+5
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows. These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow. This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb. This results in less parsing code for non-loop flush tests for TCP and UDP flows. To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization). perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14net: gro: use cb instead of skb->network_headerRichard Gobert2-5/+2
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13inet: fix inet_fill_ifaddr() flags truncationEric Dumazet1-3/+10
I missed that (struct ifaddrmsg)->ifa_flags was only 8bits, while (struct in_ifaddr)->ifa_flags is 32bits. Use a temporary 32bit variable as I did in set_ifa_lifetime() and check_lifetime(). Fixes: 3ddc2231c810 ("inet: annotate data-races around ifa->ifa_flags") Reported-by: Yu Watanabe <watanabe.yu@gmail.com> Dianosed-by: Yu Watanabe <watanabe.yu@gmail.com> Closes: https://github.com/systemd/systemd/pull/32666#issuecomment-2103977928 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240510072932.2678952-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13Merge tag 'nf-next-24-05-12' of ↵Jakub Kicinski1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 skips transaction if object type provides no .update interface. Patch #2 skips NETDEV_CHANGENAME which is unused. Patch #3 enables conntrack to handle Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) as untracked opposed to invalid packets. From Linus Luessing. Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of dropping them, from Jason Xing. Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0, also from Jason. Patch #6 removes reference in netfilter's sysctl documentation on pickup entries which were already removed by Florian Westphal. Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which allows to evict entries from the conntrack table, also from Florian. Patches #8 to #16 updates nf_tables pipapo set backend to allocate the datastructure copy on-demand from preparation phase, to better deal with OOM situations where .commit step is too late to fail. Series from Florian Westphal. Patch #17 adds a selftest with packetdrill to cover conntrack TCP state transitions, also from Florian. Patch #18 use GFP_KERNEL to clone elements from control plane to avoid quick atomic reserves exhaustion with large sets, reporter refers to million entries magnitude. * tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow clone callbacks to sleep selftests: netfilter: add packetdrill based conntrack tests netfilter: nft_set_pipapo: remove dirty flag netfilter: nft_set_pipapo: move cloning of match info to insert/removal path netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone netfilter: nft_set_pipapo: merge deactivate helper into caller netfilter: nft_set_pipapo: prepare walk function for on-demand clone netfilter: nft_set_pipapo: prepare destroy function for on-demand clone netfilter: nft_set_pipapo: make pipapo_clone helper return NULL netfilter: nft_set_pipapo: move prove_locking helper around netfilter: conntrack: remove flowtable early-drop test netfilter: conntrack: documentation: remove reference to non-existent sysctl netfilter: use NF_DROP instead of -NF_DROP netfilter: conntrack: dccp: try not to drop skb in conntrack netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler netfilter: nf_tables: skip transaction if update object is not implemented ==================== Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10tcp: get rid of twsk_unique()Eric Dumazet2-2/+2
DCCP is going away soon, and had no twsk_unique() method. We can directly call tcp_twsk_unique() for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240507164140.940547-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-5/+19
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08net/ipv4: add tracepoint for icmp_sendPeilin He1-0/+4
Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: ============================= When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: ========================== Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: ================ udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23 skbaddr=00000000589b167a Signed-off-by: Peilin He <he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Liu Chun <liu.chun2@zte.com.cn> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet2-3/+3
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protectionEric Dumazet1-1/+1
We want to be able to run rtnl_fill_ifinfo() under RCU protection instead of RTNL in the future. All rtnl_link_ops->get_link_net() methods already using dev_net() are ready. I added READ_ONCE() annotations on others. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07Merge tag 'ipsec-next-2024-05-03' of ↵Jakub Kicinski3-27/+0
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2024-05-03 1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support. This was defined by an early version of an IETF draft that did not make it to a standard. 2) Introduce direction attribute for xfrm states. xfrm states have a direction, a stsate can be used either for input or output packet processing. Add a direction to xfrm states to make it clear for what a xfrm state is used. * tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Restrict SA direction attribute to specific netlink message types xfrm: Add dir validation to "in" data path lookup xfrm: Add dir validation to "out" data path lookup xfrm: Add Direction to the SA in or out udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support ==================== Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06netfilter: use NF_DROP instead of -NF_DROPJason Xing1-1/+1
At the beginning in 2009 one patch [1] introduced collecting drop counter in nf_conntrack_in() by returning -NF_DROP. Later, another patch [2] changed the return value of tcp_packet() which now is renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As we can see, that -NF_DROP should be corrected. Similarly, there are other two points where the -NF_DROP is used. Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense as patch [2] said many years ago. [1] commit 7d1e04598e5e ("netfilter: nf_conntrack: account packets drop by tcp_packet()") [2] commit ec8d540969da ("netfilter: conntrack: fix dropping packet after l4proto->packet()") Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06net: add heuristic for enabling TCP fraglist GROFelix Fietkau1-0/+32
When forwarding TCP after GRO, software segmentation is very expensive, especially when the checksum needs to be recalculated. One case where that's currently unavoidable is when routing packets over PPPoE. Performance improves significantly when using fraglist GRO implemented in the same way as for UDP. When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established socket in the same netns as the receiving device. While this may not cover all relevant use cases in multi-netns configurations, it should be good enough for most configurations that need this. Here's a measurement of running 2 TCP streams through a MediaTek MT7622 device (2-core Cortex-A53), which runs NAT with flow offload enabled from one ethernet port to PPPoE on another ethernet port + cake qdisc set to 1Gbps. rx-gro-list off: 630 Mbit/s, CPU 35% idle rx-gro-list on: 770 Mbit/s, CPU 40% idle Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: create tcp_gro_header_pull helper functionFelix Fietkau1-21/+34
Pull the code out of tcp_gro_receive in order to access the tcp header from tcp4/6_gro_receive. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: create tcp_gro_lookup helper functionFelix Fietkau1-16/+25
This pulls the flow port matching out of tcp_gro_receive, so that it can be reused for the next change, which adds the TCP fraglist GRO heuristic. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: add code for TCP fraglist GROFelix Fietkau1-0/+21
This implements fraglist GRO similar to how it's handled in UDP, however no functional changes are added yet. The next change adds a heuristic for using fraglist GRO instead of regular GRO. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: add support for segmenting TCP fraglist GSO packetsFelix Fietkau1-0/+67
Preparation for adding TCP fraglist GRO support. It expects packets to be combined in a similar way as UDP fraglist GSO packets. For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: move skb_gro_receive_list from udp to coreFelix Fietkau1-27/+0
This helper function will be used for TCP fraglist GRO support Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-04Merge tag 'ipsec-2024-05-02' of ↵Jakub Kicinski1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2024-05-02 1) Fix an error pointer dereference in xfrm_in_fwd_icmp. From Antony Antony. 2) Preserve vlan tags for ESP transport mode software GRO. From Paul Davey. 3) Fix a spelling mistake in an uapi xfrm.h comment. From Anotny Antony. * tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Correct spelling mistake in xfrm.h comment xfrm: Preserve vlan tags for transport mode software GRO xfrm: fix possible derferencing in error path ==================== Link: https://lore.kernel.org/r/20240502084838.2269355-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03net: ipv{6,4}: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados5-16/+7
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. * Remove the zeroing out of an array element (to make it look like a sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init. This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration. * Remove extra sentinel element in the declaration of devinet_vars. * Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init, ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having an extra empty element when looping over ctl_table arrays * Replace the for loop stop condition in __addrconf_sysctl_register that tests for procname == NULL with one that depends on array size * Removing the unprivileged user check in ipv6_route_sysctl_init is safe as it is replaced by calling ipv6_route_sysctl_table_size; introduced in commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") * Use a table_size variable to keep the value of ARRAY_SIZE Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03tcp: Use refcount_inc_not_zero() in tcp_twsk_unique().Kuniyuki Iwashima1-1/+7
Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique() with nice analysis. Since commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's sk_refcnt after putting it into ehash and releasing the bucket lock. Thus, there is a small race window where other threads could try to reuse the port during connect() and call sock_hold() in tcp_twsk_unique() for the TIME-WAIT socket with zero refcnt. If that happens, the refcnt taken by tcp_twsk_unique() is overwritten and sock_put() will cause underflow, triggering a real use-after-free somewhere else. To avoid the use-after-free, we need to use refcount_inc_not_zero() in tcp_twsk_unique() and give up on reusing the port if it returns false. [0]: refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1039313 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110 CPU: 0 PID: 1039313 Comm: trigger Not tainted 6.8.6-200.fc39.x86_64 #1 Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023 RIP: 0010:refcount_warn_saturate+0xe5/0x110 Code: 42 8e ff 0f 0b c3 cc cc cc cc 80 3d aa 13 ea 01 00 0f 85 5e ff ff ff 48 c7 c7 f8 8e b7 82 c6 05 96 13 ea 01 01 e8 7b 42 8e ff <0f> 0b c3 cc cc cc cc 48 c7 c7 50 8f b7 82 c6 05 7a 13 ea 01 01 e8 RSP: 0018:ffffc90006b43b60 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff888009bb3ef0 RCX: 0000000000000027 RDX: ffff88807be218c8 RSI: 0000000000000001 RDI: ffff88807be218c0 RBP: 0000000000069d70 R08: 0000000000000000 R09: ffffc90006b439f0 R10: ffffc90006b439e8 R11: 0000000000000003 R12: ffff8880029ede84 R13: 0000000000004e20 R14: ffffffff84356dc0 R15: ffff888009bb3ef0 FS: 00007f62c10926c0(0000) GS:ffff88807be00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020ccb000 CR3: 000000004628c005 CR4: 0000000000f70ef0 PKRU: 55555554 Call Trace: <TASK> ? refcount_warn_saturate+0xe5/0x110 ? __warn+0x81/0x130 ? refcount_warn_saturate+0xe5/0x110 ? report_bug+0x171/0x1a0 ? refcount_warn_saturate+0xe5/0x110 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? refcount_warn_saturate+0xe5/0x110 tcp_twsk_unique+0x186/0x190 __inet_check_established+0x176/0x2d0 __inet_hash_connect+0x74/0x7d0 ? __pfx___inet_check_established+0x10/0x10 tcp_v4_connect+0x278/0x530 __inet_stream_connect+0x10f/0x3d0 inet_stream_connect+0x3a/0x60 __sys_connect+0xa8/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x83/0x170 entry_SYSCALL_64_after_hwframe+0x78/0x80 RIP: 0033:0x7f62c11a885d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 45 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f62c1091e58 EFLAGS: 00000296 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000020ccb004 RCX: 00007f62c11a885d RDX: 0000000000000010 RSI: 0000000020ccb000 RDI: 0000000000000003 RBP: 00007f62c1091e90 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000296 R12: 00007f62c10926c0 R13: ffffffffffffff88 R14: 0000000000000000 R15: 00007ffe237885b0 </TASK> Fixes: ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") Reported-by: Anderson Nascimento <anderson@allelesecurity.com> Closes: https://lore.kernel.org/netdev/37a477a6-d39e-486b-9577-3463f655a6b7@allelesecurity.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240501213145.62261-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV socketsEric Dumazet3-3/+7
TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers. In the following crash [1], syzbot managed to trigger a divide by zero in tcp_rcv_space_adjust() A socket makes the following state transitions, without ever calling tcp_init_transfer(), meaning tcp_init_buffer_space() is also not called. TCP_CLOSE connect() TCP_SYN_SENT TCP_SYN_RECV shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN) TCP_FIN_WAIT1 To fix this issue, change tcp_shutdown() to not perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition, which makes no sense anyway. When tcp_rcv_state_process() later changes socket state from TCP_SYN_RECV to TCP_ESTABLISH, then look at sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state, and send a FIN packet from a sane socket state. This means tcp_send_fin() can now be called from BH context, and must use GFP_ATOMIC allocations. [1] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767 Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48 RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246 RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7 R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30 R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513 tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578 inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x109/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2803 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7faeb6363db9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9 RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03bpf: tcp: Allow to write tp->snd_cwnd_stamp in bpf_tcp_caMiao Xu1-0/+3
This patch allows the write of tp->snd_cwnd_stamp in a bpf tcp ca program. An use case of writing this field is to keep track of the time whenever tp->snd_cwnd is raised or reduced inside the `cong_control` callback. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Miao Xu <miaxu@meta.com> Link: https://lore.kernel.org/r/20240502042318.801932-3-miaxu@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-03tcp: Add new args for cong_control in tcp_congestion_opsMiao Xu3-3/+4
This patch adds two new arguments for cong_control of struct tcp_congestion_ops: - ack - flag These two arguments are inherited from the caller tcp_cong_control in tcp_intput.c. One use case of them is to update cwnd and pacing rate inside cong_control based on the info they provide. For example, the flag can be used to decide if it is the right time to raise or reduce a sender's cwnd. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Miao Xu <miaxu@meta.com> Link: https://lore.kernel.org/r/20240502042318.801932-2-miaxu@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-4/+20
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net: gro: add flush check in udp_gro_receive_segmentRichard Gobert1-1/+11
GRO-GSO path is supposed to be transparent and as such L3 flush checks are relevant to all UDP flows merging in GRO. This patch uses the same logic and code from tcp_gro_receive, terminating merge if flush is non zero. Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02net: gro: fix udp bad offset in socket lookup by adding ↵Richard Gobert3-2/+5
{inner_}network_offset to napi_gro_cb Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp: additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the complete phase of gro. The functions always return skb->network_header, which in the case of encapsulated packets at the gro complete phase, is always set to the innermost L3 of the packet. That means that calling {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_ L3/L4 may return an unexpected value. This incorrect usage leads to a bug in GRO's UDP socket lookup. udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These *_hdr functions return network_header which will point to the innermost L3, resulting in the wrong offset being used in __udp{4,6}_lib_lookup with encapsulated packets. This patch adds network_offset and inner_network_offset to napi_gro_cb, and makes sure both are set correctly. To fix the issue, network_offsets union is used inside napi_gro_cb, in which both the outer and the inner network offsets are saved. Reproduction example: Endpoint configuration example (fou + local address bind) # ip fou add port 6666 ipproto 4 # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip # ip link set tun1 up # ip a add 1.1.1.2/24 dev tun1 Netperf TCP_STREAM result on net-next before patch is applied: net-next main, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.28 2.37 net-next main, GRO disabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2745.06 patch applied, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2877.38 Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02ipv4: Fix uninit-value access in __ip_make_skb()Shigeru Yoshida2-1/+4
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb() tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL while __ip_make_skb() is running, the function will access icmphdr in the skb even if it is not included. This causes the issue reported by KMSAN. Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL on the socket. Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These are union in struct flowi4 and are implicitly initialized by flowi4_init_output(), but we should not rely on specific union layout. Initialize these explicitly in raw_sendmsg(). [1] BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 ip_finish_skb include/net/ip.h:243 [inline] ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: slab_post_alloc_hook mm/slub.c:3804 [inline] slab_alloc_node mm/slub.c:3845 [inline] kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668 alloc_skb include/linux/skbuff.h:1318 [inline] __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02arp: Convert ioctl(SIOCGARP) to RCU.Kuniyuki Iwashima1-10/+18
ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name() and copy dev->name safely and calls neigh_lookup() later, which looks up a neighbour entry under RCU. Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy() with netdev_copy_name() to avoid locking rtnl_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Get dev after calling arp_req_(delete|set|get)().Kuniyuki Iwashima1-36/+50
arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. We will extend the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() for SIOCGARP. As a preparation, let's move __dev_get_by_name() into another function and call it from arp_req_delete(), arp_req_set(), and arp_req_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Remove a nest in arp_req_get().Kuniyuki Iwashima1-13/+18
This is a prep patch to make the following changes tidy. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Factorise ip_route_output() call in arp_req_set() and arp_req_delete().Kuniyuki Iwashima1-20/+30
When ioctl(SIOCDARP/SIOCSARP) is issued for non-proxy entry (no ATF_COM) without arpreq.arp_dev[] set, arp_req_set() and arp_req_delete() looks up dev based on IPv4 address by ip_route_output(). Let's factorise the same code as arp_req_dev(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Validate netmask earlier for SIOCDARP and SIOCSARP in arp_ioctl().Kuniyuki Iwashima1-12/+15
When ioctl(SIOCDARP/SIOCSARP) is issued with ATF_PUBL, r.arp_netmask must be 0.0.0.0 or 255.255.255.255. Currently, the netmask is validated in arp_req_delete_public() or arp_req_set_public() under rtnl_lock(). We have ATF_NETMASK test in arp_ioctl() before holding rtnl_lock(), so let's move the netmask validation there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Move ATF_COM setting in arp_req_set().Kuniyuki Iwashima1-3/+6
In arp_req_set(), if ATF_PERM is set in arpreq.arp_flags, ATF_COM is set automatically. The flag will be used later for neigh_update() only when a neighbour entry is found. Let's set ATF_COM just before calling neigh_update(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01net: add <net/proto_memory.h>Eric Dumazet4-0/+4
Move some proto memory definitions out of <net/sock.h> Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01tcp: move tcp_out_of_memory() to net/ipv4/tcp.cEric Dumazet1-1/+9
tcp_out_of_memory() has a single caller: tcp_check_oom(). Following patch will also make sk_memory_allocated() not anymore visible from <net/sock.h> and <net/tcp.h> Add const qualifier to sock argument of tcp_out_of_memory() and tcp_check_oom(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01net: move sysctl_max_skb_frags to net_hotdataEric Dumazet1-1/+2
sysctl_max_skb_frags is used in TCP and MPTCP fast paths, move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01inet: introduce dst_rtable() helperEric Dumazet7-33/+33
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>