summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2023-12-14net: prevent mss overflow in skb_segment()Eric Dumazet1-1/+2
Once again syzbot is able to crash the kernel in skb_segment() [1] GSO_BY_FRAGS is a forbidden value, but unfortunately the following computation in skb_segment() can reach it quite easily : mss = mss * partial_segs; 65535 = 3 * 5 * 17 * 257, so many initial values of mss can lead to a bad final result. Make sure to limit segmentation so that the new mss value is smaller than GSO_BY_FRAGS. [1] general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 1 PID: 5079 Comm: syz-executor993 Not tainted 6.7.0-rc4-syzkaller-00141-g1ae4cd3cbdd0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551 Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00 RSP: 0018:ffffc900043473d0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597 RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070 RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0 R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046 FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> udp6_ufo_fragment+0xa0e/0xd00 net/ipv6/udp_offload.c:109 ipv6_gso_segment+0x534/0x17e0 net/ipv6/ip6_offload.c:120 skb_mac_gso_segment+0x290/0x610 net/core/gso.c:53 __skb_gso_segment+0x339/0x710 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x36c/0xeb0 net/core/dev.c:3626 __dev_queue_xmit+0x6f3/0x3d60 net/core/dev.c:4338 dev_queue_xmit include/linux/netdevice.h:3134 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24c6/0x5220 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xd5/0x180 net/socket.c:745 __sys_sendto+0x255/0x340 net/socket.c:2190 __do_sys_sendto net/socket.c:2202 [inline] __se_sys_sendto net/socket.c:2198 [inline] __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f8692032aa9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 d1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff8d685418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8692032aa9 RDX: 0000000000010048 RSI: 00000000200000c0 RDI: 0000000000000003 RBP: 00000000000f4240 R08: 0000000020000540 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff8d685480 R13: 0000000000000001 R14: 00007fff8d685480 R15: 0000000000000003 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551 Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00 RSP: 0018:ffffc900043473d0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597 RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070 RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0 R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046 FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20231212164621.4131800-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-10io_uring/af_unix: disable sending io_uring over socketsPavel Begunkov1-0/+6
File reference cycles have caused lots of problems for io_uring in the past, and it still doesn't work exactly right and races with unix_stream_read_generic(). The safest fix would be to completely disallow sending io_uring files via sockets via SCM_RIGHT, so there are no possible cycles invloving registered files and thus rendering SCM accounting on the io_uring side unnecessary. Cc: stable@vger.kernel.org Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release") Reported-and-suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-08neighbour: Don't let neigh_forced_gc() disable preemption for longJudy Hsiao1-1/+8
We are seeing cases where neigh_cleanup_and_release() is called by neigh_forced_gc() many times in a row with preemption turned off. When running on a low powered CPU at a low CPU frequency, this has been measured to keep preemption off for ~10 ms. That's not great on a system with HZ=1000 which expects tasks to be able to schedule in with ~1ms latency. Suggested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Judy Hsiao <judyhsiao@chromium.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-07drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" groupIdo Schimmel1-1/+3
The "NET_DM" generic netlink family notifies drop locations over the "events" multicast group. This is problematic since by default generic netlink allows non-root users to listen to these notifications. Fix by adding a new field to the generic netlink multicast group structure that when set prevents non-root users or root without the 'CAP_SYS_ADMIN' capability (in the user namespace owning the network namespace) from joining the group. Set this field for the "events" group. Use 'CAP_SYS_ADMIN' rather than 'CAP_NET_ADMIN' because of the nature of the information that is shared over this group. Note that the capability check in this case will always be performed against the initial user namespace since the family is not netns aware and only operates in the initial network namespace. A new field is added to the structure rather than using the "flags" field because the existing field uses uAPI flags and it is inappropriate to add a new uAPI flag for an internal kernel check. In net-next we can rework the "flags" field to use internal flags and fold the new field into it. But for now, in order to reduce the amount of changes, add a new field. Since the information can only be consumed by root, mark the control plane operations that start and stop the tracing as root-only using the 'GENL_ADMIN_PERM' flag. Tested using [1]. Before: # capsh -- -c ./dm_repo # capsh --drop=cap_sys_admin -- -c ./dm_repo After: # capsh -- -c ./dm_repo # capsh --drop=cap_sys_admin -- -c ./dm_repo Failed to join "events" multicast group [1] $ cat dm.c #include <stdio.h> #include <netlink/genl/ctrl.h> #include <netlink/genl/genl.h> #include <netlink/socket.h> int main(int argc, char **argv) { struct nl_sock *sk; int grp, err; sk = nl_socket_alloc(); if (!sk) { fprintf(stderr, "Failed to allocate socket\n"); return -1; } err = genl_connect(sk); if (err) { fprintf(stderr, "Failed to connect socket\n"); return err; } grp = genl_ctrl_resolve_grp(sk, "NET_DM", "events"); if (grp < 0) { fprintf(stderr, "Failed to resolve \"events\" multicast group\n"); return grp; } err = nl_socket_add_memberships(sk, grp, NFNLGRP_NONE); if (err) { fprintf(stderr, "Failed to join \"events\" multicast group\n"); return err; } return 0; } $ gcc -I/usr/include/libnl3 -lnl-3 -lnl-genl-3 -o dm_repo dm.c Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol") Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231206213102.1824398-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07bpf: sockmap, updating the sg structure should also update currJohn Fastabend1-0/+19
Curr pointer should be updated when the sg structure is shifted. Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20231206232706.374377-3-john.fastabend@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30bpf, sockmap: af_unix stream sockets need to hold ref for pair sockJohn Fastabend1-0/+2
AF_UNIX stream sockets are a paired socket. So sending on one of the pairs will lookup the paired socket as part of the send operation. It is possible however to put just one of the pairs in a BPF map. This currently increments the refcnt on the sock in the sockmap to ensure it is not free'd by the stack before sockmap cleans up its state and stops any skbs being sent/recv'd to that socket. But we missed a case. If the peer socket is closed it will be free'd by the stack. However, the paired socket can still be referenced from BPF sockmap side because we hold a reference there. Then if we are sending traffic through BPF sockmap to that socket it will try to dereference the free'd pair in its send logic creating a use after free. And following splat: [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0 [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954 [...] [59.905468] Call Trace: [59.905787] <TASK> [59.906066] dump_stack_lvl+0x130/0x1d0 [59.908877] print_report+0x16f/0x740 [59.910629] kasan_report+0x118/0x160 [59.912576] sk_wake_async+0x31/0x1b0 [59.913554] sock_def_readable+0x156/0x2a0 [59.914060] unix_stream_sendmsg+0x3f9/0x12a0 [59.916398] sock_sendmsg+0x20e/0x250 [59.916854] skb_send_sock+0x236/0xac0 [59.920527] sk_psock_backlog+0x287/0xaa0 To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The primarily problem with keeping this logic in bpf_unix was: In the sock close() we could handle the deref by having a close handler. But, when we are destroying the psock through a map delete operation we wouldn't have gotten any signal thorugh the proto struct other than it being replaced. If we do the deref from the proto replace its too early because we need to deref the sk_pair after the backlog worker has been stopped. Given all this it seems best to just cache it at the end of the psock and eat 8B for the af_unix and vsock users. Notice dgram sockets are OK because they handle locking already. Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20231129012557.95371-2-john.fastabend@gmail.com
2023-11-20bpf, netkit: Add indirect call wrapper for fetching peer devDaniel Borkmann1-5/+13
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of indirect call wrapper and therefore optimize the bpf_redirect_peer() internal handling a bit. Add a small skb_get_peer_dev() wrapper which utilizes the INDIRECT_CALL_1() macro instead of open coding. Future work could potentially add a peer pointer directly into struct net_device in future and convert veth and netkit over to use it so that eventually ndo_get_peer_dev can be removed. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20bpf: Fix dev's rx stats for bpf_redirect_peer trafficPeilin Ye2-0/+9
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium) is not accounted for in the RX stats of supported devices (that is, veth and netkit), confusing user space metrics collectors such as cAdvisor [0], as reported by Youlun. Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the @tstats per-CPU counters (instead of @lstats, or @dstats). To make this more fool-proof, error out when ndo_get_peer_dev is set but @tstats are not selected. [0] Specifically, the "container_network_receive_{byte,packet}s_total" counters are affected. Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper") Reported-by: Youlun Zhang <zhangyoulun@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net: Move {l,t,d}stats allocation to core and convert veth & vrfDaniel Borkmann1-1/+48
Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-15net: Fix undefined behavior in netdev name allocationGal Pressman1-1/+3
Cited commit removed the strscpy() call and kept the snprintf() only. It is common to use 'dev->name' as the format string before a netdev is registered, this results in 'res' and 'name' pointers being equal. According to POSIX, if copying takes place between objects that overlap as a result of a call to sprintf() or snprintf(), the results are undefined. Add back the strscpy() and use 'buf' as an intermediate buffer. Fixes: 7ad17b04dc7b ("net: trust the bitmap in __dev_alloc_name()") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-13net: gso_test: support CONFIG_MAX_SKB_FRAGS up to 45Willem de Bruijn1-9/+5
The test allocs a single page to hold all the frag_list skbs. This is insufficient on kernels with CONFIG_MAX_SKB_FRAGS=45, due to the increased skb_shared_info frags[] array length. gso_test_func: ASSERTION FAILED at net/core/gso_test.c:210 Expected alloc_size <= ((1UL) << 12), but alloc_size == 5075 (0x13d3) ((1UL) << 12) == 4096 (0x1000) Simplify the logic. Just allocate a page for each frag_list skb. Fixes: 4688ecb1385f ("net: expand skb_segment unit test with frag_list coverage") Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-10Merge tag 'net-6.7-rc1' of ↵Linus Torvalds3-14/+11
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: - fix usec timestamps with TCP fastopen - fix possible out-of-bounds reads in tcp_hash_fail() - fix SYN option room calculation for TCP-AO - tcp_sigpool: fix some off by one bugs - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come" * tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: ti: icss-iep: fix setting counter value ptp: fix corrupted list in ptp_open ptp: ptp_read should not release queue net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP net: kcm: fill in MODULE_DESCRIPTION() net/sched: act_ct: Always fill offloading tuple iifidx netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses netfilter: xt_recent: fix (increase) ipv6 literal buffer length ipvs: add missing module descriptions netfilter: nf_tables: remove catchall element in GC sync path netfilter: add missing module descriptions drivers/net/ppp: use standard array-copy-function net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt() r8169: respect userspace disabling IFF_MULTICAST selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg net: phylink: initialize carrier state at creation test/vsock: add dobule bind connect test test/vsock: refactor vsock_accept ...
2023-11-09Merge tag 'for-netdev' of ↵Jakub Kicinski2-13/+6
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-11-08 We've added 16 non-merge commits during the last 6 day(s) which contain a total of 30 files changed, 341 insertions(+), 130 deletions(-). The main changes are: 1) Fix a BPF verifier issue in precision tracking for BPF_ALU | BPF_TO_BE | BPF_END where the source register was incorrectly marked as precise, from Shung-Hsi Yu. 2) Fix a concurrency issue in bpf_timer where the former could still have been alive after an application releases or unpins the map, from Hou Tao. 3) Fix a BPF verifier issue where immediates are incorrectly cast to u32 before being spilled and therefore losing sign information, from Hao Sun. 4) Fix a misplaced BPF_TRACE_ITER in check_css_task_iter_allowlist which incorrectly compared bpf_prog_type with bpf_attach_type, from Chuyi Zhou. 5) Add __bpf_hook_{start,end} as well as __bpf_kfunc_{start,end}_defs macros, migrate all BPF-related __diag callsites over to it, and add a new __diag_ignore_all for -Wmissing-declarations to the macros to address recent build warnings, from Dave Marchevsky. 6) Fix broken BPF selftest build of xdp_hw_metadata test on architectures where char is not signed, from Björn Töpel. 7) Fix test_maps selftest to properly use LIBBPF_OPTS() macro to initialize the bpf_map_create_opts, from Andrii Nakryiko. 8) Fix bpffs selftest to avoid unmounting /sys/kernel/debug as it may have been mounted and used by other applications already, from Manu Bretelle. 9) Fix a build issue without CONFIG_CGROUPS wrt css_task open-coded iterators, from Matthieu Baerts. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg selftests/bpf: Fix broken build where char is unsigned selftests/bpf: precision tracking test for BPF_NEG and BPF_END bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END selftests/bpf: Add test for using css_task iter in sleepable progs selftests/bpf: Add tests for css_task iter combining with cgroup iter bpf: Relax allowlist for css_task iter selftests/bpf: fix test_maps' use of bpf_map_create_opts bpf: Check map->usercnt after timer->timer is assigned bpf: Add __bpf_hook_{start,end} macros bpf: Add __bpf_kfunc_{start,end}_defs macros selftests/bpf: fix test_bpffs selftests/bpf: Add test for immediate spilled to stack bpf: Fix check_stack_write_fixed_off() to correctly spill imm bpf: fix compilation error without CGROUPS ==================== Link: https://lore.kernel.org/r/20231108132448.1970-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-03Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-11-02net: page_pool: add missing free_percpu when page_pool_init failJian Shen1-1/+5
When ptr_ring_init() returns failure in page_pool_init(), free_percpu() is not called to free pool->recycle_stats, which may cause memory leak. Fixes: ad6fa1e1ab1b ("page_pool: Add recycle stats") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20231030091256.2915394-1-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-02bpf: Add __bpf_kfunc_{start,end}_defs macrosDave Marchevsky2-13/+6
BPF kfuncs are meant to be called from BPF programs. Accordingly, most kfuncs are not called from anywhere in the kernel, which the -Wmissing-prototypes warning is unhappy about. We've peppered __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are defined in the codebase to suppress this warning. This patch adds two macros meant to bound one or many kfunc definitions. All existing kfunc definitions which use these __diag calls to suppress -Wmissing-prototypes are migrated to use the newly-introduced macros. A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier version of this patch [0] and another recent mailing list thread [1]. In the future we might need to ignore different warnings or do other kfunc-specific things. This change will make it easier to make such modifications for all kfunc defs. [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-02Merge tag 'net-next-6.7-followup' of ↵Linus Torvalds2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull more networking updates from Jakub Kicinski: - Support GRO decapsulation for IPsec ESP in UDP - Add a handful of MODULE_DESCRIPTION()s - Drop questionable alignment check in TCP AO to avoid build issue after changes in the crypto tree * tag 'net-next-6.7-followup' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: net: tcp: remove call to obsolete crypto_ahash_alignmask() net: fill in MODULE_DESCRIPTION()s under drivers/net/ net: fill in MODULE_DESCRIPTION()s under net/802* net: fill in MODULE_DESCRIPTION()s under net/core net: fill in MODULE_DESCRIPTION()s in kuba@'s modules xfrm: policy: fix layer 4 flowi decoding xfrm Fix use after free in __xfrm6_udp_encap_rcv. xfrm: policy: replace session decode with flow dissector xfrm: move mark and oif flowi decode into common code xfrm: pass struct net to xfrm_decode_session wrappers xfrm: Support GRO for IPv6 ESP in UDP encapsulation xfrm: Support GRO for IPv4 ESP in UDP encapsulation xfrm: Use the XFRM_GRO to indicate a GRO call on input xfrm: Annotate struct xfrm_sec_ctx with __counted_by xfrm: Remove unused function declarations
2023-11-02Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds1-8/+0
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-10-31Merge tag 'net-next-6.7' of ↵Linus Torvalds16-265/+868
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ...
2023-10-30Merge tag 'vfs-6.7.iov_iter' of ↵Linus Torvalds2-1/+114
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iov_iter updates from Christian Brauner: "This contain's David's iov_iter cleanup work to convert the iov_iter iteration macros to inline functions: - Remove last_offset from iov_iter as it was only used by ITER_PIPE - Add a __user tag on copy_mc_to_user()'s dst argument on x86 to match that on powerpc and get rid of a sparse warning - Convert iter->user_backed to user_backed_iter() in the sound PCM driver - Convert iter->user_backed to user_backed_iter() in a couple of infiniband drivers - Renumber the type enum so that the ITER_* constants match the order in iterate_and_advance*() - Since the preceding patch puts UBUF and IOVEC at 0 and 1, change user_backed_iter() to just use the type value and get rid of the extra flag - Convert the iov_iter iteration macros to always-inline functions to make the code easier to follow. It uses function pointers, but they get optimised away - Move the check for ->copy_mc to _copy_from_iter() and copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc() where it gets repeated for every segment. Instead, we check once and invoke a side function that can use iterate_bvec() rather than iterate_and_advance() and supply a different step function - Move the copy-and-csum code to net/ where it can be in proximity with the code that uses it - Fold memcpy_and_csum() in to its two users - Move csum_and_copy_from_iter_full() out of line and merge in csum_and_copy_from_iter() since the former is the only caller of the latter - Move hash_and_copy_to_iter() to net/ where it can be with its only caller" * tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iov_iter, net: Move hash_and_copy_to_iter() to net/ iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together iov_iter, net: Fold in csum_and_memcpy() iov_iter, net: Move csum_and_copy_to/from_iter() to net/ iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() iov_iter: Convert iterate*() to inline funcs iov_iter: Derive user-backedness from the iterator type iov_iter: Renumber ITER_* constants infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC sound: Fix snd_pcm_readv()/writev() to use iov access functions iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
2023-10-28net: fill in MODULE_DESCRIPTION()s under net/coreJakub Kicinski2-0/+2
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27net: selftests: use ethtool_sprintf()Jakub Kicinski1-6/+3
During a W=1 build GCC 13.2 says: net/core/selftests.c: In function ‘net_selftest_get_strings’: net/core/selftests.c:404:52: error: ‘%s’ directive output may be truncated writing up to 279 bytes into a region of size 28 [-Werror=format-truncation=] 404 | snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1, | ^~ net/core/selftests.c:404:17: note: ‘snprintf’ output between 5 and 284 bytes into a destination of size 32 404 | snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 405 | net_selftests[i].name); | ~~~~~~~~~~~~~~~~~~~~~~ avoid it by using ethtool_sprintf(). Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20231026022916.566661-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27rtnetlink: Add MDB get supportIdo Schimmel1-1/+88
Now that both the bridge and VXLAN drivers implement the MDB get net device operation, expose the functionality to user space by registering a handler for RTM_GETMDB messages. Derive the net device from the ifindex specified in the ancillary header and invoke its MDB get NDO. Note that unlike other get handlers, the allocation of the skb containing the response is not performed in the common rtnetlink code as the size is variable and needs to be determined by the respective driver. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27Merge tag 'for-netdev' of ↵Jakub Kicinski3-0/+24
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-26 We've added 51 non-merge commits during the last 10 day(s) which contain a total of 75 files changed, 5037 insertions(+), 200 deletions(-). The main changes are: 1) Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF, from Chuyi Zhou. 2) Fix BPF verifier's iterator convergence logic to use exact states comparison for convergence checks, from Eduard Zingerman, Andrii Nakryiko and Alexei Starovoitov. 3) Add BPF programmable net device where bpf_mprog defines the logic of its xmit routine. It can operate in L3 and L2 mode, from Daniel Borkmann and Nikolay Aleksandrov. 4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking for global per-CPU allocator, from Hou Tao. 5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section was going to be present whenever a binary has SHT_GNU_versym section, from Andrii Nakryiko. 6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into atomic_set_release(), from Paul E. McKenney. 7) Add a warning if NAPI callback missed xdp_do_flush() under CONFIG_DEBUG_NET which helps checking if drivers were missing the former, from Sebastian Andrzej Siewior. 8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing a warning under sleepable programs, from Yafang Shao. 9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before checking map_locked, from Song Liu. 10) Make BPF CI linked_list failure test more robust, from Kumar Kartikeya Dwivedi. 11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik. 12) Fix xsk starving when multiple xsk sockets were associated with a single xsk_buff_pool, from Albert Huang. 13) Clarify the signed modulo implementation for the BPF ISA standardization document that it uses truncated division, from Dave Thaler. 14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider signed bounds knowledge, from Andrii Nakryiko. 15) Add an option to XDP selftests to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP programs as capable to use frags, from Larysa Zaremba. 16) Fix bpftool's BTF dumper wrt printing a pointer value and another one to fix struct_ops dump in an array, from Manu Bretelle. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits) netkit: Remove explicit active/peer ptr initialization selftests/bpf: Fix selftests broken by mitigations=off samples/bpf: Allow building with custom bpftool samples/bpf: Fix passing LDFLAGS to libbpf samples/bpf: Allow building with custom CFLAGS/LDFLAGS bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free selftests/bpf: Add selftests for netkit selftests/bpf: Add netlink helper library bpftool: Extend net dump with netkit progs bpftool: Implement link show support for netkit libbpf: Add link-based API for netkit tools: Sync if_link uapi header netkit, bpf: Add bpf programmable net device bpf: Improve JEQ/JNE branch taken logic bpf: Fold smp_mb__before_atomic() into atomic_set_release() bpf: Fix unnecessary -EBUSY from htab_lock_bucket xsk: Avoid starving the xsk further down the list bpf: print full verifier states on infinite loop detection selftests/bpf: test if state loops are detected in a tricky case bpf: correct loop detection for iterators convergence ... ==================== Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-32/+35
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c 91535613b609 ("wifi: mac80211: don't drop all unprotected public action frames") 6c02fab72429 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value") Adjacent changes: drivers/net/ethernet/apm/xgene/xgene_enet_main.c 61471264c018 ("net: ethernet: apm: Convert to platform remove callback returning void") d2ca43f30611 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF") net/vmw_vsock/virtio_transport.c 64c99d2d6ada ("vsock/virtio: support to send non-linear skb") 53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: dsa: Use conduit and user termsFlorian Fainelli1-1/+1
Use more inclusive terms throughout the DSA subsystem by moving away from "master" which is replaced by "conduit" and "slave" which is replaced by "user". No functional changes. Acked-by: Rob Herring <robh@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20231023181729.1191071-2-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: remove else after return in dev_prep_valid_name()Jakub Kicinski1-5/+4
Remove unnecessary else clauses after return. I copied this if / else construct from somewhere, it makes the code harder to read. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: remove dev_valid_name() check from __dev_alloc_name()Jakub Kicinski1-3/+0
__dev_alloc_name() is only called by dev_prep_valid_name(), which already checks that name is valid. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: trust the bitmap in __dev_alloc_name()Jakub Kicinski1-11/+4
Prior to restructuring __dev_alloc_name() handled both printf and non-printf names. In a clever attempt at code reuse it always prints the name into a buffer and checks if it's a duplicate. Trust the bitmap, and return an error if its full. This shrinks the possible ID space by one from 32K to 32K - 1, as previously the max value would have been tried as a valid ID. It seems very unlikely that anyone would care as we heard no requests to increase the max beyond 32k. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: reduce indentation of __dev_alloc_name()Jakub Kicinski1-30/+26
All callers of __dev_valid_name() go thru dev_prep_valid_name() which handles the non-printf case. Focus __dev_alloc_name() on the sprintf case, remove the indentation level. Minor functional change of returning -EINVAL if % is not found, which should now never happen. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: make dev_alloc_name() call dev_prep_valid_name()Jakub Kicinski1-9/+11
__dev_alloc_name() handles both the sprintf and non-sprintf target names. This complicates the code. dev_prep_valid_name() already handles the non-sprintf case, before calling __dev_alloc_name(), make the only other caller also go thru dev_prep_valid_name(). This way we can drop the non-sprintf handling in __dev_alloc_name() in one of the next changes. commit 55a5ec9b7710 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and commit 029b6d140550 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"") tell us that we can't start returning -EEXIST from dev_alloc_name() on name duplicates. Bite the bullet and pass the expected errno to dev_prep_valid_name(). dev_prep_valid_name() must now propagate out the allocated id for printf names. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: don't use input buffer of __dev_alloc_name() as a scratch spaceJakub Kicinski1-25/+8
Callers of __dev_alloc_name() want to pass dev->name as the output buffer. Make __dev_alloc_name() not clobber that buffer on failure, and remove the workarounds in callers. dev_alloc_name_ns() is now completely unnecessary. The extra strscpy() added here will be gone by the end of the patch series. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24sock: Ignore memcg pressure heuristics when raising allocatedAbel Wu1-2/+12
Before sockets became aware of net-memcg's memory pressure since commit e1aab161e013 ("socket: initial cgroup code."), the memory usage would be granted to raise if below average even when under protocol's pressure. This provides fairness among the sockets of same protocol. That commit changes this because the heuristic will also be effective when only memcg is under pressure which makes no sense. So revert that behavior. After reverting, __sk_mem_raise_allocated() no longer considers memcg's pressure. As memcgs are isolated from each other w.r.t. memory accounting, consuming one's budget won't affect others. So except the places where buffer sizes are needed to be tuned, allow workloads to use the memory they are provisioned. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-3-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24sock: Doc behaviors for pressure heurisiticsAbel Wu1-1/+13
There are now two accounting infrastructures for skmem, while the heuristics in __sk_mem_raise_allocated() were actually introduced before memcg was born. Add some comments to clarify whether they can be applied to both infrastructures or not. Suggested-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-2-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24sock: Code cleanup on __sk_mem_raise_allocated()Abel Wu1-10/+12
Code cleanup for both better simplicity and readability. No functional change intended. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-1-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24page_pool: introduce page_pool_alloc() APIYunsheng Lin1-2/+2
Currently page pool supports the below use cases: use case 1: allocate page without page splitting using page_pool_alloc_pages() API if the driver knows that the memory it need is always bigger than half of the page allocated from page pool. use case 2: allocate page frag with page splitting using page_pool_alloc_frag() API if the driver knows that the memory it need is always smaller than or equal to the half of the page allocated from page pool. There is emerging use case [1] & [2] that is a mix of the above two case: the driver doesn't know the size of memory it need beforehand, so the driver may use something like below to allocate memory with least memory utilization and performance penalty: if (size << 1 > max_size) page = page_pool_alloc_pages(); else page = page_pool_alloc_frag(); To avoid the driver doing something like above, add the page_pool_alloc() API to support the above use case, and update the true size of memory that is acctually allocated by updating '*size' back to the driver in order to avoid exacerbating truesize underestimate problem. Rename page_pool_free() which is used in the destroy process to __page_pool_destroy() to avoid confusion with the newly added API. 1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/ 2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/ Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-4-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24page_pool: remove PP_FLAG_PAGE_FRAGYunsheng Lin2-3/+2
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count handling is unified and page_pool_alloc_frag() is supported in 32-bit arch with 64-bit DMA, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24page_pool: unify frag_count handling in page_pool_is_last_frag()Yunsheng Lin1-1/+9
Currently when page_pool_create() is called with PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only allowed to be called under the below constraints: 1. page_pool_fragment_page() need to be called to setup page->pp_frag_count immediately. 2. page_pool_defrag_page() often need to be called to drain the page->pp_frag_count when there is no more user will be holding on to that page. Those constraints exist in order to support a page to be split into multi fragments. And those constraints have some overhead because of the cache line dirtying/bouncing and atomic update. Those constraints are unavoidable for case when we need a page to be split into more than one fragment, but there is also case that we want to avoid the above constraints and their overhead when a page can't be split as it can only hold a fragment as requested by user, depending on different use cases: use case 1: allocate page without page splitting. use case 2: allocate page with page splitting. use case 3: allocate page with or without page splitting depending on the fragment size. Currently page pool only provide page_pool_alloc_pages() and page_pool_alloc_frag() API to enable the 1 & 2 separately, so we can not use a combination of 1 & 2 to enable 3, it is not possible yet because of the per page_pool flag PP_FLAG_PAGE_FRAG. So in order to allow allocating unsplit page without the overhead of split page while still allow allocating split page we need to remove the per page_pool flag in page_pool_is_last_frag(), as best as I can think of, it seems there are two methods as below: 1. Add per page flag/bit to indicate a page is split or not, which means we might need to update that flag/bit everytime the page is recycled, dirtying the cache line of 'struct page' for use case 1. 2. Unify the page->pp_frag_count handling for both split and unsplit page by assuming all pages in the page pool is split into a big fragment initially. As page pool already supports use case 1 without dirtying the cache line of 'struct page' whenever a page is recyclable, we need to support the above use case 3 with minimal overhead, especially not adding any noticeable overhead for use case 1, and we are already doing an optimization by not updating pp_frag_count in page_pool_defrag_page() for the last fragment user, this patch chooses to unify the pp_frag_count handling to support the above use case 3. There is no noticeable performance degradation and some justification for unifying the frag_count handling with this patch applied using a micro-benchmark testing in [1]. 1. https://lore.kernel.org/all/bf2591f8-7b3c-4480-bb2c-31dc9da1d6ac@huawei.com/ Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-21neighbour: fix various data-racesEric Dumazet1-32/+35
1) tbl->gc_thresh1, tbl->gc_thresh2, tbl->gc_thresh3 and tbl->gc_interval can be written from sysfs. 2) tbl->last_flush is read locklessly from neigh_alloc() 3) tbl->proxy_queue.qlen is read locklessly from neightbl_fill_info() 4) neightbl_fill_info() reads cpu stats that can be changed concurrently. Fixes: c7fb64db001f ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231019122104.1448310-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-20net/socket: Break down __sys_getsockoptBreno Leitao1-8/+0
Split __sys_getsockopt() into two functions by removing the core logic into a sub-function (do_sock_getsockopt()). This will avoid code duplication when doing the same operation in other callers, for instance. do_sock_getsockopt() will be called by io_uring getsockopt() command operation in the following patch. The same was done for the setsockopt pair. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231016134750.1381153-5-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-30/+68
Cross-merge networking fixes after downstream PR. net/mac80211/key.c 02e0e426a2fb ("wifi: mac80211: fix error path key leak") 2a8b665e6bcc ("wifi: mac80211: remove key_mtx") 7d6904bf26b9 ("Merge wireless into wireless-next") https://lore.kernel.org/all/20231012113648.46eea5ec@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/ti/Kconfig a602ee3176a8 ("net: ethernet: ti: Fix mixed module-builtin object") 98bdeae9502b ("net: cpmac: remove driver to prepare for platform removal") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19net: move altnames together with the netdeviceJakub Kicinski1-4/+9
The altname nodes are currently not moved to the new netns when netdevice itself moves: [ ~]# ip netns add test [ ~]# ip -netns test link add name eth0 type dummy [ ~]# ip -netns test link property add dev eth0 altname some-name [ ~]# ip -netns test link show dev some-name 2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff altname some-name [ ~]# ip -netns test link set dev eth0 netns 1 [ ~]# ip link ... 3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff altname some-name [ ~]# ip li show dev some-name Device "some-name" does not exist. Remove them from the hash table when device is unlisted and add back when listed again. Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: avoid UAF on deleted altnameJakub Kicinski1-1/+6
Altnames are accessed under RCU (dev_get_by_name_rcu()) but freed by kfree() with no synchronization point. Each node has one or two allocations (node and a variable-size name, sometimes the name is netdev->name). Adding rcu_heads here is a bit tedious. Besides most code which unlists the names already has rcu barriers - so take the simpler approach of adding synchronize_rcu(). Note that the one on the unregistration path (which matters more) is removed by the next fix. Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: check for altname conflicts when changing netdev's netnsJakub Kicinski2-1/+11
It's currently possible to create an altname conflicting with an altname or real name of another device by creating it in another netns and moving it over: [ ~]$ ip link add dev eth0 type dummy [ ~]$ ip netns add test [ ~]$ ip -netns test link add dev ethX netns test type dummy [ ~]$ ip -netns test link property add dev ethX altname eth0 [ ~]$ ip -netns test link set dev ethX netns 1 [ ~]$ ip link ... 3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff ... 5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff altname eth0 Create a macro for walking the altnames, this hopefully makes it clearer that the list we walk contains only altnames. Which is otherwise not entirely intuitive. Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: fix ifname in netlink ntf during netns moveJakub Kicinski1-13/+31
dev_get_valid_name() overwrites the netdev's name on success. This makes it hard to use in prepare-commit-like fashion, where we do validation first, and "commit" to the change later. Factor out a helper which lets us save the new name to a buffer. Use it to fix the problem of notification on netns move having incorrect name: 5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff 6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff [ ~]# ip link set dev eth0 netns 1 name eth1 ip monitor inside netns: Deleted inet eth0 Deleted inet6 eth0 Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7 Name is reported as eth1 in old netns for ifindex 5, already renamed. Fixes: d90310243fd7 ("net: device name allocation cleanups") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: introduce napi_is_scheduled helperChristian Marangi1-1/+1
We currently have napi_if_scheduled_mark_missed that can be used to check if napi is scheduled but that does more thing than simply checking it and return a bool. Some driver already implement custom function to check if napi is scheduled. Drop these custom function and introduce napi_is_scheduled that simply check if napi is scheduled atomically. Update any driver and code that implement a similar check and instead use this new helper. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-18net: skb_find_text: Ignore patterns extending past 'to'Phil Sutter1-1/+2
Assume that caller's 'to' offset really represents an upper boundary for the pattern search, so patterns extending past this offset are to be rejected. The old behaviour also was kind of inconsistent when it comes to fragmentation (or otherwise non-linear skbs): If the pattern started in between 'to' and 'from' offsets but extended to the next fragment, it was not found if 'to' offset was still within the current fragment. Test the new behaviour in a kselftest using iptables' string match. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: f72b948dcbb8 ("[NET]: skb_find_text ignores to argument") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18net: pktgen: Fix interface flags printingGavrilov Ilia1-7/+7
Device flags are displayed incorrectly: 1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared with FLOW_SEQ_SHIFT. 2) Similarly to the F_IPSEC flag. 3) Also add spaces to the print end of the string literal "spi:%u" to prevent the output from merging with the flag that follows. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 99c6d3d20d62 ("pktgen: Remove brute-force printing of flags") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-17net, bpf: Add a warning if NAPI cb missed xdp_do_flush().Sebastian Andrzej Siewior3-0/+24
A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
2023-10-17net: gso_test: release each segment individuallyFlorian Westphal1-5/+9
consume_skb() doesn't walk the segment list, so segments other than the first are leaked. Move this skb_consume call into the loop. Cc: Willem de Bruijn <willemb@google.com> Fixes: b3098d32ed6e ("net: add skb_segment kunit test") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>