summaryrefslogtreecommitdiff
path: root/include/linux/skbuff.h
AgeCommit message (Collapse)AuthorFilesLines
2024-01-17netfilter: bridge: replace physindev with physinif in nf_bridge_infoPavel Tikhomirov1-1/+1
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. As skb->dev can be reset back to nf_bridge->physindev and used, and as there is no explicit mechanism that prevents this physindev from been freed under us (for instance neigh_flush_dev doesn't cleanup skbs from different device's neigh queue) we can crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(&neigh->arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish Let's use plain ifindex instead of net_device link. To peek into the original net_device we will use dev_get_by_index_rcu(). Thus either we get device and are safe to use it or we don't get it and drop skb. Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-12-27net: rename dsa_realloc_skb to skb_ensure_writable_head_tailRadu Pirea (NXP OSS)1-0/+1
Rename dsa_realloc_skb to skb_ensure_writable_head_tail and move it to skbuff.c to use it as helper. Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-23net: skbuff: Remove some excess struct-member documentationJonathan Corbet1-2/+0
Remove documentation for nonexistent structure members, addressing these warnings: ./include/linux/skbuff.h:1063: warning: Excess struct member 'sp' description in 'sk_buff' ./include/linux/skbuff.h:1063: warning: Excess struct member 'nf_bridge' description in 'sk_buff' Signed-off-by: Jonathan Corbet <corbet@lwn.net> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-19Merge tag 'for-netdev' of ↵Jakub Kicinski1-5/+8
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-15net: skbuff: fix spelling errorsRandy Dunlap1-3/+3
Correct spelling as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231213043511.10357-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-11net, xdp: Allow metadata > 32Aleksander Lobakin1-5/+8
32 bytes may be not enough for some custom metadata. Relax the restriction, allow metadata larger than 32 bytes and make __skb_metadata_differs() work with bigger lengths. Now size of metadata is only limited by the fact it is stored as u8 in skb_shared_info, so maximum possible value is 255. Size still has to be aligned to 4, so the actual upper limit becomes 252. Most driver implementations will offer less, none can offer more. Other important conditions, such as having enough space for xdp_frame building, are already checked in bpf_xdp_adjust_meta(). Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/eb87653c-8ff8-447d-a7a1-25961f60518a@kernel.org Link: https://lore.kernel.org/bpf/20231206205919.404415-3-larysa.zaremba@intel.com
2023-11-30xsk: Add TX timestamp and TX checksum offload supportStanislav Fomichev1-1/+13
This change actually defines the (initial) metadata layout that should be used by AF_XDP userspace (xsk_tx_metadata). The first field is flags which requests appropriate offloads, followed by the offload-specific fields. The supported per-device offloads are exported via netlink (new xsk-flags). The offloads themselves are still implemented in a bit of a framework-y fashion that's left from my initial kfunc attempt. I'm introducing new xsk_tx_metadata_ops which drivers are supposed to implement. The drivers are also supposed to call xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since xsk_tx_metadata_{request,_complete} are static inline, we don't incur any extra overhead doing indirect calls. The benefit of this scheme is as follows: - keeps all metadata layout parsing away from driver code - makes it easy to grep and see which drivers implement what - don't need any extra flags to maintain to keep track of what offloads are implemented; if the callback is implemented - the offload is supported (used by netlink reporting code) Two offloads are defined right now: 1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset 2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata area upon completion (tx_timestamp field) XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass metadata pointer). The struct is forward-compatible and can be extended in the future by appending more fields. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-30Merge tag 'vfs-6.7.iov_iter' of ↵Linus Torvalds1-0/+3
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iov_iter updates from Christian Brauner: "This contain's David's iov_iter cleanup work to convert the iov_iter iteration macros to inline functions: - Remove last_offset from iov_iter as it was only used by ITER_PIPE - Add a __user tag on copy_mc_to_user()'s dst argument on x86 to match that on powerpc and get rid of a sparse warning - Convert iter->user_backed to user_backed_iter() in the sound PCM driver - Convert iter->user_backed to user_backed_iter() in a couple of infiniband drivers - Renumber the type enum so that the ITER_* constants match the order in iterate_and_advance*() - Since the preceding patch puts UBUF and IOVEC at 0 and 1, change user_backed_iter() to just use the type value and get rid of the extra flag - Convert the iov_iter iteration macros to always-inline functions to make the code easier to follow. It uses function pointers, but they get optimised away - Move the check for ->copy_mc to _copy_from_iter() and copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc() where it gets repeated for every segment. Instead, we check once and invoke a side function that can use iterate_bvec() rather than iterate_and_advance() and supply a different step function - Move the copy-and-csum code to net/ where it can be in proximity with the code that uses it - Fold memcpy_and_csum() in to its two users - Move csum_and_copy_from_iter_full() out of line and merge in csum_and_copy_from_iter() since the former is the only caller of the latter - Move hash_and_copy_to_iter() to net/ where it can be with its only caller" * tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iov_iter, net: Move hash_and_copy_to_iter() to net/ iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together iov_iter, net: Fold in csum_and_memcpy() iov_iter, net: Move csum_and_copy_to/from_iter() to net/ iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() iov_iter: Convert iterate*() to inline funcs iov_iter: Derive user-backedness from the iterator type iov_iter: Renumber ITER_* constants infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC sound: Fix snd_pcm_readv()/writev() to use iov access functions iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
2023-10-11net: skbuff: fix kernel-doc typosRandy Dunlap1-2/+2
Correct punctuation and drop an extraneous word. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231008214121.25940-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-09iov_iter, net: Merge csum_and_copy_from_iter{,_full}() togetherDavid Howells1-17/+2
Move csum_and_copy_from_iter_full() out of line and then merge csum_and_copy_from_iter() into its only caller. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230925120309.1731676-12-dhowells@redhat.com cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Christian Brauner <christian@brauner.io> cc: Matthew Wilcox <willy@infradead.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: David Laight <David.Laight@ACULAB.COM> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: netdev@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09iov_iter, net: Fold in csum_and_memcpy()David Howells1-7/+0
Fold csum_and_memcpy() in to its callers. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230925120309.1731676-11-dhowells@redhat.com cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Christian Brauner <christian@brauner.io> cc: Matthew Wilcox <willy@infradead.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: David Laight <David.Laight@ACULAB.COM> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: netdev@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09iov_iter, net: Move csum_and_copy_to/from_iter() to net/David Howells1-0/+25
Move csum_and_copy_to/from_iter() to net code now that the iteration framework can be #included. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230925120309.1731676-10-dhowells@redhat.com cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Christian Brauner <christian@brauner.io> cc: Matthew Wilcox <willy@infradead.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: David Laight <David.Laight@ACULAB.COM> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: netdev@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-20net: selectively purge error queue in IP_RECVERR / IPV6_RECVERREric Dumazet1-0/+1
Setting IP_RECVERR and IPV6_RECVERR options to zero currently purges the socket error queue, which was probably not expected for zerocopy and tx_timestamp users. I discovered this issue while preparing commit 6b5f43ea0815 ("inet: move inet->recverr to inet->inet_flags"), I presume this change does not need to be backported to stable kernels. Add skb_errqueue_purge() helper to purge error messages only. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net: add skb_queue_purge_reason and __skb_queue_purge_reasonEric Dumazet1-4/+19
skb_queue_purge() and __skb_queue_purge() become wrappers around the new generic functions. New SKB_DROP_REASON_QUEUE_PURGE drop reason is added, but users can start adding more specific reasons. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-07net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>Alexander Lobakin1-2/+3
Currently, touching <net/page_pool/types.h> triggers a rebuild of more than half of the kernel. That's because it's included in <linux/skbuff.h>. And each new include to page_pool/types.h adds more [useless] data for the toolchain to process per each source file from that pile. In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling"), Matteo included it to be able to call a couple of functions defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as long as page pool owns the page") one of the calls was removed, so only one was left. It's the call to page_pool_return_skb_page() in napi_frag_unref(). The function is external and doesn't have any dependencies. Having very niche page_pool_types.h included only for that looks like an overkill. As %PP_SIGNATURE is not local to page_pool.c (was only in the early submissions), nothing holds this function there. Teleport page_pool_return_skb_page() to skbuff.c, just next to the main consumer, skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't work with skbs at all and the former name tells nothing. The #if guards here are only to not compile and have it in the vmlinux when not needed -- both call sites are already guarded. Now, touching page_pool_types.h only triggers rebuilding of the drivers using it and a couple of core networking files. Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07page_pool: split types and declarations from page_pool.hYunsheng Lin1-1/+1
Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-26net: skbuff: remove unused HAVE_HW_TIME_STAMP feature definePeter Seiderer1-2/+0
Remove unused HAVE_HW_TIME_STAMP feature define (introduced by commit ac45f602ee3d ("net: infrastructure for hardware time stamping"). Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-19bpf, net: Introduce skb_pointer_if_linear().Alexei Starovoitov1-1/+9
Network drivers always call skb_header_pointer() with non-null buffer. Remove !buffer check to prevent accidental misuse of skb_header_pointer(). Introduce skb_pointer_if_linear() instead. Reported-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230718234021.43640-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: Add fd-based tcx multi-prog infra with link supportDaniel Borkmann1-2/+2
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-10net: move gso declarations and functions to their own filesEric Dumazet1-71/+0
Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-31skbuff: bridge: Add layer 2 miss indicationIdo Schimmel1-0/+1
For EVPN non-DF (Designated Forwarder) filtering we need to be able to prevent decapsulated traffic from being flooded to a multi-homed host. Filtering of multicast and broadcast traffic can be achieved using the following flower filter: # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop Unlike broadcast and multicast traffic, it is not currently possible to filter unknown unicast traffic. The classification into unknown unicast is performed by the bridge driver, but is not visible to other layers such as tc. Solve this by adding a new 'l2_miss' bit to the tc skb extension. Clear the bit whenever a packet enters the bridge (received from a bridge port or transmitted via the bridge) and set it if the packet did not match an FDB or MDB entry. If there is no skb extension and the bit needs to be cleared, then do not allocate one as no extension is equivalent to the bit being cleared. The bit is not set for broadcast packets as they never perform a lookup and therefore never incur a miss. A bit that is set for every flooded packet would also work for the current use case, but it does not allow us to differentiate between registered and unregistered multicast traffic, which might be useful in the future. To keep the performance impact to a minimum, the marking of packets is guarded by the 'tc_skb_ext_tc' static key. When 'false', the skb is not touched and an skb extension is not allocated. Instead, only a 5 bytes nop is executed, as demonstrated below for the call site in br_handle_frame(). Before the patch: ``` memset(skb->cb, 0, sizeof(struct br_input_skb_cb)); c37b09: 49 c7 44 24 28 00 00 movq $0x0,0x28(%r12) c37b10: 00 00 p = br_port_get_rcu(skb->dev); c37b12: 49 8b 44 24 10 mov 0x10(%r12),%rax memset(skb->cb, 0, sizeof(struct br_input_skb_cb)); c37b17: 49 c7 44 24 30 00 00 movq $0x0,0x30(%r12) c37b1e: 00 00 c37b20: 49 c7 44 24 38 00 00 movq $0x0,0x38(%r12) c37b27: 00 00 ``` After the patch (when static key is disabled): ``` memset(skb->cb, 0, sizeof(struct br_input_skb_cb)); c37c29: 49 c7 44 24 28 00 00 movq $0x0,0x28(%r12) c37c30: 00 00 c37c32: 49 8d 44 24 28 lea 0x28(%r12),%rax c37c37: 48 c7 40 08 00 00 00 movq $0x0,0x8(%rax) c37c3e: 00 c37c3f: 48 c7 40 10 00 00 00 movq $0x0,0x10(%rax) c37c46: 00 #ifdef CONFIG_HAVE_JUMP_LABEL_HACK static __always_inline bool arch_static_branch(struct static_key *key, bool branch) { asm_volatile_goto("1:" c37c47: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) br_tc_skb_miss_set(skb, false); p = br_port_get_rcu(skb->dev); c37c4c: 49 8b 44 24 10 mov 0x10(%r12),%rax ``` Subsequent patches will extend the flower classifier to be able to match on the new 'l2_miss' bit and enable / disable the static key when filters that match on it are added / deleted. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+10
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/raw.c 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") c85be08fc4fa ("raw: Stop using RTO_ONLINK.") https://lore.kernel.org/all/20230525110037.2b532b83@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/freescale/fec_main.c 9025944fddfe ("net: fec: add dma_wmb to ensure correct descriptor values") 144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGESDavid Howells1-0/+3
Add a function to handle MSG_SPLICE_PAGES being passed internally to sendmsg(). Pages are spliced into the given socket buffer if possible and copied in if not (e.g. they're slab pages or have a zero refcount). Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net: Pass max frags into skb_append_pagefrags()David Howells1-1/+1
Pass the maximum number of fragments into skb_append_pagefrags() rather than using MAX_SKB_FRAGS so that it can be used from code that wants to specify sysctl_max_skb_frags. Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-19tls: rx: strp: force mixed decrypted records into copy modeJakub Kicinski1-0/+10
If a record is partially decrypted we'll have to CoW it, anyway, so go into copy mode and allocate a writable skb right away. This will make subsequent fix simpler because we won't have to teach tls_strp_msg_make_copy() how to copy skbs while preserving decrypt status. Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-17Merge tag 'for-netdev' of ↵Jakub Kicinski1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-05-16 We've added 57 non-merge commits during the last 19 day(s) which contain a total of 63 files changed, 3293 insertions(+), 690 deletions(-). The main changes are: 1) Add precision propagation to verifier for subprogs and callbacks, from Andrii Nakryiko. 2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths, from Stanislav Fomichev. 3) Utilize pahole v1.25 for the kernel's BTF generation to filter out inconsistent function prototypes, from Alan Maguire. 4) Various dyn-pointer verifier improvements to relax restrictions, from Daniel Rosenberg. 5) Add a new bpf_task_under_cgroup() kfunc for designated task, from Feng Zhou. 6) Unblock tests for arm64 BPF CI after ftrace supporting direct call, from Florent Revest. 7) Add XDP hint kfunc metadata for RX hash/timestamp for igc, from Jesper Dangaard Brouer. 8) Add several new dyn-pointer kfuncs to ease their usability, from Joanne Koong. 9) Add in-depth LRU internals description and dot function graph, from Joe Stringer. 10) Fix KCSAN report on bpf_lru_list when accessing node->ref, from Martin KaFai Lau. 11) Only dump unprivileged_bpf_disabled log warning upon write, from Kui-Feng Lee. 12) Extend test_progs to directly passing allow/denylist file, from Stephen Veiss. 13) Fix BPF trampoline memleak upon failure attaching to fentry, from Yafang Shao. 14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits) bpf: Fix memleak due to fentry attach failure bpf: Remove bpf trampoline selector bpf, arm64: Support struct arguments in the BPF trampoline bpftool: JIT limited misreported as negative value on aarch64 bpf: fix calculation of subseq_idx during precision backtracking bpf: Remove anonymous union in bpf_kfunc_call_arg_meta bpf: Document EFAULT changes for sockopt selftests/bpf: Correctly handle optlen > 4096 selftests/bpf: Update EFAULT {g,s}etsockopt selftests bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen libbpf: fix offsetof() and container_of() to work with CO-RE bpf: Address KCSAN report on bpf_lru_list bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25 selftests/bpf: Accept mem from dynptr in helper funcs bpf: verifier: Accept dynptr mem as mem in helpers selftests/bpf: Check overflow in optional buffer selftests/bpf: Test allowing NULL buffer in dynptr slice bpf: Allow NULL buffers in bpf_dynptr_slice(_rw) selftests/bpf: Add testcase for bpf_task_under_cgroup bpf: Add bpf_task_under_cgroup() kfunc ... ==================== Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16net: skbuff: update comment about pfmemalloc propagatingYunsheng Lin1-5/+5
__skb_fill_page_desc_noacc() is not doing any pfmemalloc propagating, and yet it has a comment about that, commit 84ce071e38a6 ("net: introduce __skb_fill_page_desc_noacc") may have accidentally moved it to __skb_fill_page_desc_noacc(), so move it back to __skb_fill_page_desc() which is supposed to be doing pfmemalloc propagating. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20230515050107.46397-1-linyunsheng@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-13net: remove __skb_frag_set_page()Yunsheng Lin1-12/+0
The remaining users calling __skb_frag_set_page() with page being NULL seems to be doing defensive programming, as shinfo->nr_frags is already decremented, so remove them. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: introduce and use skb_frag_fill_page_desc()Yunsheng Lin1-17/+10
Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-07bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)Daniel Rosenberg1-1/+1
bpf_dynptr_slice(_rw) uses a user provided buffer if it can not provide a pointer to a block of contiguous memory. This buffer is unused in the case of local dynptrs, and may be unused in other cases as well. There is no need to require the buffer, as the kfunc can just return NULL if it was needed and not provided. This adds another kfunc annotation, __opt, which combines with __sz and __szk to allow the buffer associated with the size to be NULL. If the buffer is NULL, the verifier does not check that the buffer is of sufficient size. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-2-drosen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-22Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+9
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-21 We've added 71 non-merge commits during the last 8 day(s) which contain a total of 116 files changed, 13397 insertions(+), 8896 deletions(-). The main changes are: 1) Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward, from Florian Westphal. 2) Fix race between btf_put and btf_idr walk which caused a deadlock, from Alexei Starovoitov. 3) Second big batch to migrate test_verifier unit tests into test_progs for ease of readability and debugging, from Eduard Zingerman. 4) Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree, from Dave Marchevsky. 5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF selftests into libbpf-provided bpf_helpers.h header and improve kfunc handling, from Andrii Nakryiko. 6) Support 64-bit pointers to kfuncs needed for archs like s390x, from Ilya Leoshkevich. 7) Support BPF progs under getsockopt with a NULL optval, from Stanislav Fomichev. 8) Improve verifier u32 scalar equality checking in order to enable LLVM transformations which earlier had to be disabled specifically for BPF backend, from Yonghong Song. 9) Extend bpftool's struct_ops object loading to support links, from Kui-Feng Lee. 10) Add xsk selftest follow-up fixes for hugepage allocated umem, from Magnus Karlsson. 11) Support BPF redirects from tc BPF to ifb devices, from Daniel Borkmann. 12) Add BPF support for integer type when accessing variable length arrays, from Feng Zhou. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits) selftests/bpf: verifier/value_ptr_arith converted to inline assembly selftests/bpf: verifier/value_illegal_alu converted to inline assembly selftests/bpf: verifier/unpriv converted to inline assembly selftests/bpf: verifier/subreg converted to inline assembly selftests/bpf: verifier/spin_lock converted to inline assembly selftests/bpf: verifier/sock converted to inline assembly selftests/bpf: verifier/search_pruning converted to inline assembly selftests/bpf: verifier/runtime_jit converted to inline assembly selftests/bpf: verifier/regalloc converted to inline assembly selftests/bpf: verifier/ref_tracking converted to inline assembly selftests/bpf: verifier/map_ptr_mixing converted to inline assembly selftests/bpf: verifier/map_in_map converted to inline assembly selftests/bpf: verifier/lwt converted to inline assembly selftests/bpf: verifier/loops1 converted to inline assembly selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly selftests/bpf: verifier/direct_packet_access converted to inline assembly selftests/bpf: verifier/d_path converted to inline assembly selftests/bpf: verifier/ctx converted to inline assembly selftests/bpf: verifier/btf_ctx_access converted to inline assembly selftests/bpf: verifier/bpf_get_stack converted to inline assembly ... ==================== Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21net: move dropreason.h to dropreason-core.hJohannes Berg1-1/+1
This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21net: skbuff: update and rename __kfree_skb_defer()Jakub Kicinski1-1/+1
__kfree_skb_defer() uses the old naming where "defer" meant slab bulk free/alloc APIs. In the meantime we also made __kfree_skb_defer() feed the per-NAPI skb cache, which implies bulk APIs. So take away the 'defer' and add 'napi'. While at it add a drop reason. This only matters on the tx_action path, if the skb has a frag_list. But getting rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net benefit so why not. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+3
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19net: skbuff: hide nf_trace and ipvs_propertyJakub Kicinski1-0/+4
Accesses to nf_trace and ipvs_property are already wrapped by ifdefs where necessary. Don't allocate the bits for those fields at all if possible. Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: push nf_trace down the bitfieldJakub Kicinski1-3/+3
nf_trace is a debug feature, AFAIU, and yet it sits oddly high in the sk_buff bitfield. Move it down, pushing up dst_pending_confirm and inner_protocol_type. Next change will make nf_trace optional (under Kconfig) and all optional fields should be placed after 2b fields to avoid 2b fields straddling bytes. dst_pending_confirm is L3, so it makes sense next to ignore_df. inner_protocol_type goes up just to keep the balance. Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: move alloc_cpu into a potential holeJakub Kicinski1-1/+2
alloc_cpu is currently between 4 byte fields, so it's almost guaranteed to create a 2B hole. It has a knock on effect of creating a 4B hole after @end (and @end and @tail being in different cachelines). None of this matters hugely, but for kernel configs which don't enable all the features there may well be a 2B hole after the bitfield. Move alloc_cpu there. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: hide csum_not_inet when CONFIG_IP_SCTP not setJakub Kicinski1-0/+14
SCTP is not universally deployed, allow hiding its bit from the skb. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: hide wifi_acked when CONFIG_WIRELESS not setJakub Kicinski1-0/+11
Datacenter kernel builds will very likely not include WIRELESS, so let them shave 2 bits off the skb by hiding the wifi fields. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-17bpf: Set skb redirect and from_ingress info in __bpf_tx_skbDaniel Borkmann1-0/+9
There are some use-cases where it is desirable to use bpf_redirect() in combination with ifb device, which currently is not supported, for example, around filtering inbound traffic with BPF to then push it to ifb which holds the qdisc for shaping in contrast to doing that on the egress device. Toke mentions the following case related to OpenWrt: Because there's not always a single egress on the other side. These are mainly home routers, which tend to have one or more WiFi devices bridged to one or more ethernet ports on the LAN side, and a single upstream WAN port. And the objective is to control the total amount of traffic going over the WAN link (in both directions), to deal with bufferbloat in the ISP network (which is sadly still all too prevalent). In this setup, the traffic can be split arbitrarily between the links on the LAN side, and the only "single bottleneck" is the WAN link. So we install both egress and ingress shapers on this, configured to something like 95-98% of the true link bandwidth, thus moving the queues into the qdisc layer in the router. It's usually necessary to set the ingress bandwidth shaper a bit lower than the egress due to being "downstream" of the bottleneck link, but it does work surprisingly well. We usually use something like a matchall filter to put all ingress traffic on the ifb, so doing the redirect from BPF has not been an immediate requirement thus far. However, it does seem a bit odd that this is not possible, and we do have a BPF-based filter that layers on top of this kind of setup, which currently uses u32 as the ingress filter and so it could presumably be improved to use BPF instead if that was available. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Reported-by: Yafang Shao <laoar.shao@gmail.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17netfilter: nf_tables: fix ifdef to also consider nf_tables=mFlorian Westphal1-2/+2
nftables can be built as a module, so fix the preprocessor conditional accordingly. Fixes: 478b360a47b7 ("netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-15page_pool: allow caching from safely localized NAPIJakub Kicinski1-7/+13
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14Daniel Borkmann says:Jakub Kicinski1-20/+20
==================== pull-request: bpf-next 2023-04-13 We've added 260 non-merge commits during the last 36 day(s) which contain a total of 356 files changed, 21786 insertions(+), 11275 deletions(-). The main changes are: 1) Rework BPF verifier log behavior and implement it as a rotating log by default with the option to retain old-style fixed log behavior, from Andrii Nakryiko. 2) Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params, from Christian Ehrig. 3) Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton, from Alexei Starovoitov. 4) Optimize hashmap lookups when key size is multiple of 4, from Anton Protopopov. 5) Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps, from David Vernet. 6) Add support for stashing local BPF kptr into a map value via bpf_kptr_xchg(). This is useful e.g. for rbtree node creation for new cgroups, from Dave Marchevsky. 7) Fix BTF handling of is_int_ptr to skip modifiers to work around tracing issues where a program cannot be attached, from Feng Zhou. 8) Migrate a big portion of test_verifier unit tests over to test_progs -a verifier_* via inline asm to ease {read,debug}ability, from Eduard Zingerman. 9) Several updates to the instruction-set.rst documentation which is subject to future IETF standardization (https://lwn.net/Articles/926882/), from Dave Thaler. 10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register known bits information propagation, from Daniel Borkmann. 11) Add skb bitfield compaction work related to BPF with the overall goal to make more of the sk_buff bits optional, from Jakub Kicinski. 12) BPF selftest cleanups for build id extraction which stand on its own from the upcoming integration work of build id into struct file object, from Jiri Olsa. 13) Add fixes and optimizations for xsk descriptor validation and several selftest improvements for xsk sockets, from Kal Conley. 14) Add BPF links for struct_ops and enable switching implementations of BPF TCP cong-ctls under a given name by replacing backing struct_ops map, from Kui-Feng Lee. 15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable offset stack read as earlier Spectre checks cover this, from Luis Gerhorst. 16) Fix issues in copy_from_user_nofault() for BPF and other tracers to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner and Alexei Starovoitov. 17) Add --json-summary option to test_progs in order for CI tooling to ease parsing of test results, from Manu Bretelle. 18) Batch of improvements and refactoring to prep for upcoming bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator, from Martin KaFai Lau. 19) Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations, from Quentin Monnet. 20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting the module name from BTF of the target and searching kallsyms of the correct module, from Viktor Malik. 21) Improve BPF verifier handling of '<const> <cond> <non_const>' to better detect whether in particular jmp32 branches are taken, from Yonghong Song. 22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock. A built-in cc or one from a kernel module is already able to write to app_limited, from Yixin Shen. Conflicts: Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") 0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ include/net/ip_tunnels.h bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts") ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices") https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/ net/bpf/test_run.c e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") 294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES") https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/ ==================== Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-06netfilter: br_netfilter: fix recent physdev match breakageFlorian Westphal1-0/+1
Recent attempt to ensure PREROUTING hook is executed again when a decrypted ipsec packet received on a bridge passes through the network stack a second time broke the physdev match in INPUT hook. We can't discard the nf_bridge info strct from sabotage_in hook, as this is needed by the physdev match. Keep the struct around and handle this with another conditional instead. Fixes: 2b272bb558f1 ("netfilter: br_netfilter: disable sabotage_in hook after first suppression") Reported-and-tested-by: Farid BENAMROUCHE <fariouche@yahoo.fr> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-28net: introduce a config option to tweak MAX_SKB_FRAGSEric Dumazet1-11/+5
Currently, MAX_SKB_FRAGS value is 17. For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuffing 32768 bytes per frag. But with zero copy, we use order-0 pages. For BIG TCP to show its full potential, we add a config option to be able to fit up to 45 segments per skb. This is also needed for BIG TCP rx zerocopy, as zerocopy currently does not support skbs with frag list. We have used MAX_SKB_FRAGS=45 value for years at Google before we deployed 4K MTU, with no adverse effect, other than a recent issue in mlx4, fixed in commit 26782aad00cc ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS") Back then, goal was to be able to receive full size (64KB) GRO packets without the frag_list overhead. Note that /proc/sys/net/core/max_skb_frags can also be used to limit the number of fragments TCP can use in tx packets. By default we keep the old/legacy value of 17 until we get more coverage for the updated values. Sizes of struct skb_shared_info on 64bit arches MAX_SKB_FRAGS | sizeof(struct skb_shared_info): ============================================== 17 320 21 320+64 = 384 25 320+128 = 448 29 320+192 = 512 33 320+256 = 576 37 320+320 = 640 41 320+384 = 704 45 320+448 = 768 This inflation might cause problems for drivers assuming they could pack both the incoming packet (for MTU=1500) and skb_shared_info in half a page, using build_skb(). v3: fix build error when CONFIG_NET=n v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long" Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20230323162842.1935061-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-21net: skbuff: move the fields BPF cares about directly next to the offset markerJakub Kicinski1-9/+9
To avoid more possible BPF dependencies with moving bitfields around keep the fields BPF cares about right next to the offset marker. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230321014115.997841-4-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-21net: skbuff: reorder bytes 2 and 3 of the bitfieldJakub Kicinski1-10/+10
BPF needs to know the offsets of fields it tries to access. Zero-length fields are added to make offsetof() work. This unfortunately partitions the bitfield (fields across the zero-length members can't be coalesced). Reorder bytes 2 and 3, BPF needs to know the offset of fields previously in byte 3 and some fields in byte 2 should really be optional. The two bytes are always in the same cacheline so it should not matter. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230321014115.997841-3-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-21net: skbuff: rename __pkt_vlan_present_offset to __mono_tc_offsetJakub Kicinski1-2/+2
vlan_present is gone since commit 354259fa73e2 ("net: remove skb->vlan_present") rename the offset field to what BPF is currently looking for in this byte - mono_delivery_time and tc_at_ingress. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230321014115.997841-2-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-15net: page_pool, skbuff: make skb_mark_for_recycle() always availableAlexander Lobakin1-2/+2
skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates unneeded complication when using it in the generic code. For now, it's only used in the drivers always selecting Page Pool, so this works. Move the guards so that preprocessor will cut out only the operation itself and the function will still be a noop on !PAGE_POOL systems, but available there as well. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-3-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08net: reclaim skb->scm_io_uring bitEric Dumazet1-2/+0
Commit 0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release") added one bit to struct sk_buff. This structure is critical for networking, and we try very hard to not add bloat on it, unless absolutely required. For instance, we can use a specific destructor as a wrapper around unix_destruct_scm(), to identify skbs that unix_gc() has to special case. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>