summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2024-04-04af_unix: Remove lock dance in unix_peek_fds().Kuniyuki Iwashima1-1/+0
In the previous GC implementation, the shape of the inflight socket graph was not expected to change while GC was in progress. MSG_PEEK was tricky because it could install inflight fd silently and transform the graph. Let's say we peeked a fd, which was a listening socket, and accept()ed some embryo sockets from it. The garbage collection algorithm would have been confused because the set of sockets visited in scan_inflight() would change within the same GC invocation. That's why we placed spin_lock(&unix_gc_lock) and spin_unlock() in unix_peek_fds() with a fat comment. In the new GC implementation, we no longer garbage-collect the socket if it exists in another queue, that is, if it has a bridge to another SCC. Also, accept() will require the lock if it has edges. Thus, we need not do the complicated lock dance. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240401173125.92184-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04trace: tcp: fully support trace_tcp_send_resetJason Xing1-2/+38
Prior to this patch, what we can see by enabling trace_tcp_send is only happening under two circumstances: 1) active rst mode 2) non-active rst mode and based on the full socket That means the inconsistency occurs if we use tcpdump and trace simultaneously to see how rst happens. It's necessary that we should take into other cases into considerations, say: 1) time-wait socket 2) no socket ... By parsing the incoming skb and reversing its 4-tuple can we know the exact 'flow' which might not exist. Samples after applied this patch: 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port state=TCP_ESTABLISHED 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port state=UNKNOWN Note: 1) UNKNOWN means we cannot extract the right information from skb. 2) skbaddr/skaddr could be 0 Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240401073605.37335-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04trace: adjust TP_STORE_ADDR_PORTS_SKB() parametersJason Xing3-11/+13
Introducing entry_saddr and entry_daddr parameters in this macro for later use can help us record the reverse 4-tuple by analyzing the 4-tuple of the incoming skb when receiving. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240401073605.37335-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03netdevice: add DEFINE_FREE() for dev_putJohannes Berg1-0/+2
For short netdev holds within a function there are still a lot of users of dev_put() rather than netdev_put(). Add DEFINE_FREE() to allow making those safer. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-03rtnetlink: add guard for RTNLJohannes Berg1-0/+3
The new guard/scoped_gard can be useful for the RTNL as well, so add a guard definition for it. It gets used like { guard(rtnl)(); // RTNL held until end of block } or scoped_guard(rtnl) { // RTNL held in this block } as with any other guard/scoped_guard. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-03uapi: team: use header file generated from YAML specHangbin Liu1-73/+43
generated with: $ ./tools/net/ynl/ynl-gen-c.py --mode uapi \ > --spec Documentation/netlink/specs/team.yaml \ > --header -o include/uapi/linux/if_team.h Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240401031004.1159713-5-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03tcp/dccp: complete lockless accesses to sk->sk_max_ack_backlogJason Xing1-1/+1
Since commit 099ecf59f05b ("net: annotate lockless accesses to sk->sk_max_ack_backlog") decided to handle the sk_max_ack_backlog locklessly, there is one more function mostly called in TCP/DCCP cases. So this patch completes it:) Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240331090521.71965-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03page_pool: check for PP direct cache locality laterAlexander Lobakin1-6/+6
Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check whether it's safe to use direct recycling, we can use both globally for each page instead of relying solely on @allow_direct argument. Let's assume that @allow_direct means "I'm sure it's local, don't waste time rechecking this" and when it's false, try the mentioned params to still recycle the page directly. If neither is true, we'll lose some CPU cycles, but then it surely won't be hotpath. On the other hand, paths where it's possible to use direct cache, but not possible to safely set @allow_direct, will benefit from this move. The whole propagation of @napi_safe through a dozen of skb freeing functions can now go away, which saves us some stack space. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03rhashtable: Improve grammarJonathan Neuschäfer1-5/+5
Change "a" to "an" according to the usual rules, fix an "if" that was mistyped as "in", improve grammar in "considerable slow" -> "considerably slower". Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240329-misc-rhashtable-v1-1-5862383ff798@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02genetlink: remove linux/genetlink.hJakub Kicinski3-16/+10
genetlink.h is a shell of what used to be a combined uAPI and kernel header over a decade ago. It has fewer than 10 lines of code. Merge it into net/genetlink.h. In some ways it'd be better to keep the combined header under linux/ but it would make looking through git history harder. Acked-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20240329175710.291749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02netlink: create a new header for internal genetlink symbolsJakub Kicinski1-5/+0
There are things in linux/genetlink.h which are only used under net/netlink/. Move them to a new local header. A new header with just 2 externs isn't great, but alternative would be to include af_netlink.h in genetlink.c which feels even worse. Link: https://lore.kernel.org/r/20240329175710.291749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02tcp/dccp: do not care about families in inet_twsk_purge()Eric Dumazet2-2/+2
We lost ability to unload ipv6 module a long time ago. Instead of calling expensive inet_twsk_purge() twice, we can handle all families in one round. Also remove an extra line added in my prior patch, per Kuniyuki Iwashima feedback. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20240327192934.6843-1-kuniyu@amazon.com/ Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240329153203.345203-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02inet: preserve const qualifier in inet_csk()Eric Dumazet4-7/+4
We can change inet_csk() to propagate its argument const qualifier, thanks to container_of_const(). We have to fix few places that had mistakes, like tcp_bound_rto(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240329144931.295800-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-01net: rps: move received_rps field to a better locationEric Dumazet1-1/+1
Commit 14d898f3c1b3 ("dev: Move received_rps counter next to RPS members in softnet data") was unfortunate: received_rps is dirtied by a cpu and never read by other cpus in fast path. Its presence in the hot RPS cache line (shared by many cpus) is hurting RPS/RFS performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: add rps_input_queue_head_add() helperEric Dumazet1-2/+7
process_backlog() can batch increments of sd->input_queue_head, saving some memory bandwidth. Also add READ_ONCE()/WRITE_ONCE() annotations around sd->input_queue_head accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: change input_queue_tail_incr_save()Eric Dumazet2-15/+23
input_queue_tail_incr_save() is incrementing the sd queue_tail and save it in the flow last_qtail. Two issues here : - no lock protects the write on last_qtail, we should use appropriate annotations. - We can perform this write after releasing the per-cpu backlog lock, to decrease this lock hold duration (move away the cache line miss) Also move input_queue_head_incr() and rps helpers to include/net/rps.h, while adding rps_ prefix to better reflect their role. v2: Fixed a build issue (Jakub and kernel build bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: make softnet_data.dropped an atomic_tEric Dumazet1-1/+2
If under extreme cpu backlog pressure enqueue_to_backlog() has to drop a packet, it could do this without dirtying a cache line and potentially slowing down the target cpu. Move sd->dropped into a separate cache line, and make it atomic. In non pressure mode, this field is not touched, no need to consume valuable space in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move dev_xmit_recursion() helpers to net/core/dev.hEric Dumazet1-17/+0
Move dev_xmit_recursion() and friends to net/core/dev.h They are only used from net/core/dev.c and net/core/filter.c. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move kick_defer_list_purge() to net/core/dev.hEric Dumazet1-1/+0
kick_defer_list_purge() is defined in net/core/dev.c and used from net/core/skubff.c Because we need softnet_data, include <linux/netdevice.h> from net/core/dev.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01pfcp: always set pfcp metadataMichal Swiatkowski4-0/+93
In PFCP receive path set metadata needed by flower code to do correct classification based on this metadata. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01pfcp: add PFCP moduleWojciech Drewek1-0/+17
Packet Forwarding Control Protocol (PFCP) is a 3GPP Protocol used between the control plane and the user plane function. It is specified in TS 29.244[1]. Note that this module is not designed to support this Protocol in the kernel space. There is no support for parsing any PFCP messages. There is no API that could be used by any userspace daemon. Basically it does not support PFCP. This protocol is sophisticated and there is no need for implementing it in the kernel. The purpose of this module is to allow users to setup software and hardware offload of PFCP packets using tc tool. When user requests to create a PFCP device, a new socket is created. The socket is set up with port number 8805 which is specific for PFCP [29.244 4.2.2]. This allow to receive PFCP request messages, response messages use other ports. Note that only one PFCP netdev can be created. Only IPv4 is supported at this time. [1] https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3111 Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ip_tunnel: convert __be16 tunnel flags to bitmapsAlexander Lobakin7-65/+177
Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT have been defined as __be16. Now all of those 16 bits are occupied and there's no more free space for new flags. It can't be simply switched to a bigger container with no adjustments to the values, since it's an explicit Endian storage, and on LE systems (__be16)0x0001 equals to (__be64)0x0001000000000000. We could probably define new 64-bit flags depending on the Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on LE, but that would introduce an Endianness dependency and spawn a ton of Sparse warnings. To mitigate them, all of those places which were adjusted with this change would be touched anyway, so why not define stuff properly if there's no choice. Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the value already coded and a fistful of <16 <-> bitmap> converters and helpers. The two flags which have a different bit position are SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as __cpu_to_be16(), but as (__force __be16), i.e. had different positions on LE and BE. Now they both have strongly defined places. Change all __be16 fields which were used to store those flags, to IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) -> unsigned long[1] for now, and replace all TUNNEL_* occurrences to their bitmap counterparts. Use the converters in the places which talk to the userspace, hardware (NFP) or other hosts (GRE header). The rest must explicitly use the new flags only. This must be done at once, otherwise there will be too many conversions throughout the code in the intermediate commits. Finally, disable the old __be16 flags for use in the kernel code (except for the two 'irregular' flags mentioned above), to prevent any accidental (mis)use of them. For the userspace, nothing is changed, only additions were made. Most noticeable bloat-o-meter difference (.text): vmlinux: 307/-1 (306) gre.ko: 62/0 (62) ip_gre.ko: 941/-217 (724) [*] ip_tunnel.ko: 390/-900 (-510) [**] ip_vti.ko: 138/0 (138) ip6_gre.ko: 534/-18 (516) [*] ip6_tunnel.ko: 118/-10 (108) [*] gre_flags_to_tnl_flags() grew, but still is inlined [**] ip_tunnel_find() got uninlined, hence such decrease The average code size increase in non-extreme case is 100-200 bytes per module, mostly due to sizeof(long) > sizeof(__be16), as %__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers are able to expand the majority of bitmap_*() calls here into direct operations on scalars. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ip_tunnel: use a separate struct to store tunnel params in the kernelAlexander Lobakin2-8/+24
Unlike IPv6 tunnels which use purely-kernel __ip6_tnl_parm structure to store params inside the kernel, IPv4 tunnel code uses the same ip_tunnel_parm which is being used to talk with the userspace. This makes it difficult to alter or add any fields or use a different format for whatever data. Define struct ip_tunnel_parm_kern, a 1:1 copy of ip_tunnel_parm for now, and use it throughout the code. Define the pieces, where the copy user <-> kernel happens, as standalone functions, and copy the data there field-by-field, so that the kernel-side structure could be easily modified later on and the users wouldn't have to care about this. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitmap: make bitmap_{get,set}_value8() use bitmap_{read,write}()Alexander Lobakin1-33/+5
Now that we have generic bitmap_read() and bitmap_write(), which are inline and try to take care of non-bound-crossing and aligned cases to keep them optimized, collapse bitmap_{get,set}_value8() into simple wrappers around the former ones. bloat-o-meter shows no difference in vmlinux and -2 bytes for gpio-pca953x.ko, which says the optimization didn't suffer due to that change. The converted helpers have the value width embedded and always compile-time constant and that helps a lot. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitmap: introduce generic optimized bitmap_size()Alexander Lobakin2-4/+6
The number of times yet another open coded `BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge. Some generic helper is long overdue. Add one, bitmap_size(), but with one detail. BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both divident and divisor are compile-time constants or when the divisor is not a pow-of-2. When it is however, the compilers sometimes tend to generate suboptimal code (GCC 13): 48 83 c0 3f add $0x3f,%rax 48 c1 e8 06 shr $0x6,%rax 48 8d 14 c5 00 00 00 00 lea 0x0(,%rax,8),%rdx %BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does full division of `nbits + 63` by it and then multiplication by 8. Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC: 8d 50 3f lea 0x3f(%rax),%edx c1 ea 03 shr $0x3,%edx 81 e2 f8 ff ff 1f and $0x1ffffff8,%edx Now it shifts `nbits + 63` by 3 positions (IOW performs fast division by 8) and then masks bits[2:0]. bloat-o-meter: add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617) Clang does it better and generates the same code before/after starting from -O1, except that with the ALIGN() approach it uses %edx and thus still saves some bytes: add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520) Note that we can't expand DIV_ROUND_UP() by adding a check and using this approach there, as it's used in array declarations where expressions are not allowed. Add this helper to tools/ as well. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01linkmode: convert linkmode_{test,set,clear,mod}_bit() to macrosAlexander Lobakin1-23/+4
Since commit b03fc1173c0c ("bitops: let optimize out non-atomic bitops on compile-time constants"), the non-atomic bitops are macros which can be expanded by the compilers into compile-time expressions, which will result in better optimized object code. Unfortunately, turned out that passing `volatile` to those macros discards any possibility of optimization, as the compilers then don't even try to look whether the passed bitmap is known at compilation time. In addition to that, the mentioned linkmode helpers are marked with `inline`, not `__always_inline`, meaning that it's not guaranteed some compiler won't uninline them for no reason, which will also effectively prevent them from being optimized (it's a well-known thing the compilers sometimes uninline `2 + 2`). Convert linkmode_*_bit() from inlines to macros. Their calling convention are 1:1 with the corresponding bitops, so that it's not even needed to enumerate and map the arguments, only the names. No changes in vmlinux' object code (compiled by LLVM for x86_64) whatsoever, but that doesn't necessarily means the change is meaningless. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitops: let the compiler optimize {__,}assign_bit()Alexander Lobakin1-16/+4
Since commit b03fc1173c0c ("bitops: let optimize out non-atomic bitops on compile-time constants"), the compilers are able to expand inline bitmap operations to compile-time initializers when possible. However, during the round of replacement if-__set-else-__clear with __assign_bit() as per Andy's advice, bloat-o-meter showed +1024 bytes difference in object code size for one module (even one function), where the pattern: DECLARE_BITMAP(foo) = { }; // on the stack, zeroed if (a) __set_bit(const_bit_num, foo); if (b) __set_bit(another_const_bit_num, foo); ... is heavily used, although there should be no difference: the bitmap is zeroed, so the second half of __assign_bit() should be compiled-out as a no-op. I either missed the fact that __assign_bit() has bitmap pointer marked as `volatile` (as we usually do for bitops) or was hoping that the compilers would at least try to look past the `volatile` for __always_inline functions. Anyhow, due to that attribute, the compilers were always compiling the whole expression and no mentioned compile-time optimizations were working. Convert __assign_bit() to a macro since it's a very simple if-else and all of the checks are performed inside __set_bit() and __clear_bit(), thus that wrapper has to be as transparent as possible. After that change, despite it showing only -20 bytes change for vmlinux (due to that it's still relatively unpopular), no drastic code size changes happen when replacing if-set-else-clear for onstack bitmaps with __assign_bit(), meaning the compiler now expands them to the actual operations will all the expected optimizations. Atomic assign_bit() is less affected due to its nature, but let's convert it to a macro as well to keep the code consistent and not leave a place for possible suboptimal codegen. Moreover, with certain kernel configuration it actually gives some saves (x86): do_ip_setsockopt 4154 4099 -55 Suggested-by: Yury Norov <yury.norov@gmail.com> # assign_bit(), too Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitops: make BYTES_TO_BITS() treewide-availableAlexander Lobakin1-0/+2
Avoid open-coding that simple expression each time by moving BYTES_TO_BITS() from the probes code to <linux/bitops.h> to export it to the rest of the kernel. Simplify the macro while at it. `BITS_PER_LONG / sizeof(long)` always equals to %BITS_PER_BYTE, regardless of the target architecture. Do the same for the tools ecosystem as well (incl. its version of bitops.h). The previous implementation had its implicit type of long, while the new one is int, so adjust the format literal accordingly in the perf code. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitops: add missing prototype checkAlexander Lobakin1-0/+1
Commit 8238b4579866 ("wait_on_bit: add an acquire memory barrier") added a new bitop, test_bit_acquire(), with proper wrapping in order to try to optimize it at compile-time, but missed the list of bitops used for checking their prototypes a bit below. The functions added have consistent prototypes, so that no more changes are required and no functional changes take place. Fixes: 8238b4579866 ("wait_on_bit: add an acquire memory barrier") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01lib/bitmap: add bitmap_{read,write}()Syed Nayyar Waris1-0/+77
The two new functions allow reading/writing values of length up to BITS_PER_LONG bits at arbitrary position in the bitmap. The code was taken from "bitops: Introduce the for_each_set_clump macro" by Syed Nayyar Waris with a number of changes and simplifications: - instead of using roundup(), which adds an unnecessary dependency on <linux/math.h>, we calculate space as BITS_PER_LONG-offset; - indentation is reduced by not using else-clauses (suggested by checkpatch for bitmap_get_value()); - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read() and bitmap_write(); - some redundant computations are omitted. Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com> Signed-off-by: William Breathitt Gray <william.gray@linaro.org> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/ Suggested-by: Yury Norov <yury.norov@gmail.com> Co-developed-by: Alexander Potapenko <glider@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-30netlink: introduce type-checking attribute iterationJohannes Berg1-0/+27
There are, especially with multi-attr arrays, many cases of needing to iterate all attributes of a specific type in a netlink message or a nested attribute. Add specific macros to support that case. Also convert many instances using this spatch: @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } Although I had to undo one bad change this made, and I also adjusted some other code for whitespace and to use direct variable initialization now. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-30net: add sk_wake_async_rcu() helperEric Dumazet1-0/+6
While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined. This matters at least on AMD Zen1-4 platforms (see SRSO) This might be because rcu_read_lock() and rcu_read_unlock() are no longer nops in recent kernels ? Add sk_wake_async_rcu() variant, which must be called from contexts already holding rcu lock. As SOCK_FASYNC is deprecated in modern days, use unlikely() to give a hint to the compiler. sk_wake_async_rcu() is properly inlined from __udp_enqueue_schedule_skb() and sock_def_readable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skbBalazs Scheidler1-5/+24
The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source and destination IP/port whereas this information can be critical in case of UDP/syslog. Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/0c8b3e33dbf679e190be6f4c6736603a76988a20.1711475011.git.balazs.scheidler@axoflow.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independentBalazs Scheidler2-43/+42
This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes the TCP specific implementation details. Previously the macro assumed the skb passed as an argument is a TCP packet, the implementation now uses an argument to the L4 header and uses that to extract the source/destination ports, which happen to be named the same in "struct tcphdr" and "struct udphdr" Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com> Link: https://lore.kernel.org/r/9e306f78260dfbbdc7353ba5f864cc027a409540.1711475011.git.balazs.scheidler@axoflow.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Replace garbage collection algorithm.Kuniyuki Iwashima1-8/+0
If we find a dead SCC during iteration, we call unix_collect_skb() to splice all skb in the SCC to the global sk_buff_head, hitlist. After iterating all SCC, we unlock unix_gc_lock and purge the queue. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Assign a unique index to SCC.Kuniyuki Iwashima1-1/+1
The definition of the lowlink in Tarjan's algorithm is the smallest index of a vertex that is reachable with at most one back-edge in SCC. This is not useful for a cross-edge. If we start traversing from A in the following graph, the final lowlink of D is 3. The cross-edge here is one between D and C. A -> B -> D D = (4, 3) (index, lowlink) ^ | | C = (3, 1) | V | B = (2, 1) `--- C <--' A = (1, 1) This is because the lowlink of D is updated with the index of C. In the following patch, we detect a dead SCC by checking two conditions for each vertex. 1) vertex has no edge directed to another SCC (no bridge) 2) vertex's out_degree is the same as the refcount of its file If 1) is false, there is a receiver of all fds of the SCC and its ancestor SCC. To evaluate 1), we need to assign a unique index to each SCC and assign it to all vertices in the SCC. This patch changes the lowlink update logic for cross-edge so that in the example above, the lowlink of D is updated with the lowlink of C. A -> B -> D D = (4, 1) (index, lowlink) ^ | | C = (3, 1) | V | B = (2, 1) `--- C <--' A = (1, 1) Then, all vertices in the same SCC have the same lowlink, and we can quickly find the bridge connecting to different SCC if exists. However, it is no longer called lowlink, so we rename it to scc_index. (It's sometimes called lowpoint.) Also, we add a global variable to hold the last index used in DFS so that we do not reset the initial index in each DFS. This patch can be squashed to the SCC detection patch but is split deliberately for anyone wondering why lowlink is not used as used in the original Tarjan's algorithm and many reference implementations. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-13-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Save O(n) setup of Tarjan's algo.Kuniyuki Iwashima1-1/+0
Before starting Tarjan's algorithm, we need to mark all vertices as unvisited. We can save this O(n) setup by reserving two special indices (0, 1) and using two variables. The first time we link a vertex to unix_unvisited_vertices, we set unix_vertex_unvisited_index to index. During DFS, we can see that the index of unvisited vertices is the same as unix_vertex_unvisited_index. When we finalise SCC later, we set unix_vertex_grouped_index to each vertex's index. Then, we can know (i) that the vertex is on the stack if the index of a visited vertex is >= 2 and (ii) that it is not on the stack and belongs to a different SCC if the index is unix_vertex_grouped_index. After the whole algorithm, all indices of vertices are set as unix_vertex_grouped_index. Next time we start DFS, we know that all unvisited vertices have unix_vertex_grouped_index, and we can use unix_vertex_unvisited_index as the not-on-stack marker. To use the same variable in __unix_walk_scc(), we can swap unix_vertex_(grouped|unvisited)_index at the end of Tarjan's algorithm. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Fix up unix_edge.successor for embryo socket.Kuniyuki Iwashima1-0/+1
To garbage collect inflight AF_UNIX sockets, we must define the cyclic reference appropriately. This is a bit tricky if the loop consists of embryo sockets. Suppose that the fd of AF_UNIX socket A is passed to D and the fd B to C and that C and D are embryo sockets of A and B, respectively. It may appear that there are two separate graphs, A (-> D) and B (-> C), but this is not correct. A --. .-- B X C <-' `-> D Now, D holds A's refcount, and C has B's refcount, so unix_release() will never be called for A and B when we close() them. However, no one can call close() for D and C to free skbs holding refcounts of A and B because C/D is in A/B's receive queue, which should have been purged by unix_release() for A and B. So, here's another type of cyclic reference. When a fd of an AF_UNIX socket is passed to an embryo socket, the reference is indirectly held by its parent listening socket. .-> A .-> B | `- sk_receive_queue | `- sk_receive_queue | `- skb | `- skb | `- sk == C | `- sk == D | `- sk_receive_queue | `- sk_receive_queue | `- skb +---------' `- skb +-. | | `---------------------------------------------------------' Technically, the graph must be denoted as A <-> B instead of A (-> D) and B (-> C) to find such a cyclic reference without touching each socket's receive queue. .-> A --. .-- B <-. | X | == A <-> B `-- C <-' `-> D --' We apply this fixup during GC by fetching the real successor by unix_edge_successor(). When we call accept(), we clear unix_sock.listener under unix_gc_lock not to confuse GC. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Save listener for embryo socket.Kuniyuki Iwashima1-0/+1
This is a prep patch for the following change, where we need to fetch the listening socket from the successor embryo socket during GC. We add a new field to struct unix_sock to save a pointer to a listening socket. We set it when connect() creates a new socket, and clear it when accept() is called. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Detect Strongly Connected Components.Kuniyuki Iwashima1-0/+3
In the new GC, we use a simple graph algorithm, Tarjan's Strongly Connected Components (SCC) algorithm, to find cyclic references. The algorithm visits every vertex exactly once using depth-first search (DFS). DFS starts by pushing an input vertex to a stack and assigning it a unique number. Two fields, index and lowlink, are initialised with the number, but lowlink could be updated later during DFS. If a vertex has an edge to an unvisited inflight vertex, we visit it and do the same processing. So, we will have vertices in the stack in the order they appear and number them consecutively in the same order. If a vertex has a back-edge to a visited vertex in the stack, we update the predecessor's lowlink with the successor's index. After iterating edges from the vertex, we check if its index equals its lowlink. If the lowlink is different from the index, it shows there was a back-edge. Then, we go backtracking and propagate the lowlink to its predecessor and resume the previous edge iteration from the next edge. If the lowlink is the same as the index, we pop vertices before and including the vertex from the stack. Then, the set of vertices is SCC, possibly forming a cycle. At the same time, we move the vertices to unix_visited_vertices. When we finish the algorithm, all vertices in each SCC will be linked via unix_vertex.scc_entry. Let's take an example. We have a graph including five inflight vertices (F is not inflight): A -> B -> C -> D -> E (-> F) ^ | `---------' Suppose that we start DFS from C. We will visit C, D, and B first and initialise their index and lowlink. Then, the stack looks like this: > B = (3, 3) (index, lowlink) D = (2, 2) C = (1, 1) When checking B's edge to C, we update B's lowlink with C's index and propagate it to D. B = (3, 1) (index, lowlink) > D = (2, 1) C = (1, 1) Next, we visit E, which has no edge to an inflight vertex. > E = (4, 4) (index, lowlink) B = (3, 1) D = (2, 1) C = (1, 1) When we leave from E, its index and lowlink are the same, so we pop E from the stack as single-vertex SCC. Next, we leave from B and D but do nothing because their lowlink are different from their index. B = (3, 1) (index, lowlink) D = (2, 1) > C = (1, 1) Then, we leave from C, whose index and lowlink are the same, so we pop B, D and C as SCC. Last, we do DFS for the rest of vertices, A, which is also a single-vertex SCC. Finally, each unix_vertex.scc_entry is linked as follows: A -. B -> C -> D E -. ^ | ^ | ^ | `--' `---------' `--' We use SCC later to decide whether we can garbage-collect the sockets. Note that we still cannot detect SCC properly if an edge points to an embryo socket. The following two patches will sort it out. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Iterate all vertices by DFS.Kuniyuki Iwashima1-0/+2
The new GC will use a depth first search graph algorithm to find cyclic references. The algorithm visits every vertex exactly once. Here, we implement the DFS part without recursion so that no one can abuse it. unix_walk_scc() marks every vertex unvisited by initialising index as UNIX_VERTEX_INDEX_UNVISITED and iterates inflight vertices in unix_unvisited_vertices and call __unix_walk_scc() to start DFS from an arbitrary vertex. __unix_walk_scc() iterates all edges starting from the vertex and explores the neighbour vertices with DFS using edge_stack. After visiting all neighbours, __unix_walk_scc() moves the visited vertex to unix_visited_vertices so that unix_walk_scc() will not restart DFS from the visited vertex. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Link struct unix_edge when queuing skb.Kuniyuki Iwashima2-0/+3
Just before queuing skb with inflight fds, we call scm_stat_add(), which is a good place to set up the preallocated struct unix_vertex and struct unix_edge in UNIXCB(skb).fp. Then, we call unix_add_edges() and construct the directed graph as follows: 1. Set the inflight socket's unix_sock to unix_edge.predecessor. 2. Set the receiver's unix_sock to unix_edge.successor. 3. Set the preallocated vertex to inflight socket's unix_sock.vertex. 4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices. 5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges. Let's say we pass the fd of AF_UNIX socket A to B and the fd of B to C. The graph looks like this: +-------------------------+ | unix_unvisited_vertices | <-------------------------. +-------------------------+ | + | | +--------------+ +--------------+ | +--------------+ | | unix_sock A | <---. .---> | unix_sock B | <-|-. .---> | unix_sock C | | +--------------+ | | +--------------+ | | | +--------------+ | .-+ | vertex | | | .-+ | vertex | | | | | vertex | | | +--------------+ | | | +--------------+ | | | +--------------+ | | | | | | | | | | +--------------+ | | | +--------------+ | | | | '-> | unix_vertex | | | '-> | unix_vertex | | | | | +--------------+ | | +--------------+ | | | `---> | entry | +---------> | entry | +-' | | |--------------| | | |--------------| | | | edges | <-. | | | edges | <-. | | +--------------+ | | | +--------------+ | | | | | | | | | .----------------------' | | .----------------------' | | | | | | | | | +--------------+ | | | +--------------+ | | | | unix_edge | | | | | unix_edge | | | | +--------------+ | | | +--------------+ | | `-> | vertex_entry | | | `-> | vertex_entry | | | |--------------| | | |--------------| | | | predecessor | +---' | | predecessor | +---' | |--------------| | |--------------| | | successor | +-----' | successor | +-----' +--------------+ +--------------+ Henceforth, we denote such a graph as A -> B (-> C). Now, we can express all inflight fd graphs that do not contain embryo sockets. We will support the particular case later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd.Kuniyuki Iwashima2-0/+11
As with the previous patch, we preallocate to skb's scm_fp_list an array of struct unix_edge in the number of inflight AF_UNIX fds. There we just preallocate memory and do not use immediately because sendmsg() could fail after this point. The actual use will be in the next patch. When we queue skb with inflight edges, we will set the inflight socket's unix_sock as unix_edge->predecessor and the receiver's unix_sock as successor, and then we will link the edge to the inflight socket's unix_vertex.edges. Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup() so that MSG_PEEK does not change the shape of the directed graph. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.Kuniyuki Iwashima2-0/+12
We will replace the garbage collection algorithm for AF_UNIX, where we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. This patch introduces a new struct unix_vertex representing a vertex in the graph and adds its pointer to struct unix_sock. When we send a fd using the SCM_RIGHTS message, we allocate struct scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump each refcount of the inflight fds' struct file and save them in scm_fp_list.fp. After that, unix_attach_fds() inexplicably clones scm_fp_list of scm_cookie and sets it to skb. (We will remove this part after replacing GC.) Here, we add a new function call in unix_attach_fds() to preallocate struct unix_vertex per inflight AF_UNIX fd and link each vertex to skb's scm_fp_list.vertices. When sendmsg() succeeds later, if the socket of the inflight fd is still not inflight yet, we will set the preallocated vertex to struct unix_sock.vertex and link it to a global list unix_unvisited_vertices under spin_lock(&unix_gc_lock). If the socket is already inflight, we free the preallocated vertex. This is to avoid taking the lock unnecessarily when sendmsg() could fail later. In the following patch, we will similarly allocate another struct per edge, which will finally be linked to the inflight socket's unix_vertex.edges. And then, we will count the number of edges as unix_vertex.out_degree. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29net/smc: make smc_hash_sk/smc_unhash_sk staticZhengchao Shao1-3/+0
smc_hash_sk and smc_unhash_sk are only used in af_smc.c, so make them static and remove the output symbol. They can be called under the path .prot->hash()/unhash(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29net: sched: make skip_sw actually skip softwareAsbjørn Sloth Tønnesen2-0/+10
TC filters come in 3 variants: - no flag (try to process in hardware, but fallback to software)) - skip_hw (do not process filter by hardware) - skip_sw (do not process filter by software) However skip_sw is implemented so that the skip_sw flag can first be checked, after it has been matched. IMHO it's common when using skip_sw, to use it on all rules. So if all filters in a block is skip_sw filters, then we can bail early, we can thus avoid having to match the filters, just to check for the skip_sw flag. This patch adds a bypass, for when only TC skip_sw rules are used. The bypass is guarded by a static key, to avoid harming other workloads. There are 3 ways that a packet from a skip_sw ruleset, can end up in the kernel path. Although the send packets to a non-existent chain way is only improved a few percents, then I believe it's worth optimizing the trap and fall-though use-cases. +----------------------------+--------+--------+--------+ | Test description | Pre- | Post- | Rel. | | | kpps | kpps | chg. | +----------------------------+--------+--------+--------+ | basic forwarding + notrack | 3589.3 | 3587.9 | 1.00x | | switch to eswitch mode | 3081.8 | 3094.7 | 1.00x | | add ingress qdisc | 3042.9 | 3063.6 | 1.01x | | tc forward in hw / skip_sw |37024.7 |37028.4 | 1.00x | | tc forward in sw / skip_hw | 3245.0 | 3245.3 | 1.00x | +----------------------------+--------+--------+--------+ | tests with only skip_sw rules below: | +----------------------------+--------+--------+--------+ | 1 non-matching rule | 2694.7 | 3058.7 | 1.14x | | 1 n-m rule, match trap | 2611.2 | 3323.1 | 1.27x | | 1 n-m rule, goto non-chain | 2886.8 | 2945.9 | 1.02x | | 5 non-matching rules | 1958.2 | 3061.3 | 1.56x | | 5 n-m rules, match trap | 1911.9 | 3327.0 | 1.74x | | 5 n-m rules, goto non-chain| 2883.1 | 2947.5 | 1.02x | | 10 non-matching rules | 1466.3 | 3062.8 | 2.09x | | 10 n-m rules, match trap | 1444.3 | 3317.9 | 2.30x | | 10 n-m rules,goto non-chain| 2883.1 | 2939.5 | 1.02x | | 25 non-matching rules | 838.5 | 3058.9 | 3.65x | | 25 n-m rules, match trap | 824.5 | 3323.0 | 4.03x | | 25 n-m rules,goto non-chain| 2875.8 | 2944.7 | 1.02x | | 50 non-matching rules | 488.1 | 3054.7 | 6.26x | | 50 n-m rules, match trap | 484.9 | 3318.5 | 6.84x | | 50 n-m rules,goto non-chain| 2884.1 | 2939.7 | 1.02x | +----------------------------+--------+--------+--------+ perf top (25 n-m skip_sw rules - pre patch): 20.39% [kernel] [k] __skb_flow_dissect 16.43% [kernel] [k] rhashtable_jhash2 10.58% [kernel] [k] fl_classify 10.23% [kernel] [k] fl_mask_lookup 4.79% [kernel] [k] memset_orig 2.58% [kernel] [k] tcf_classify 1.47% [kernel] [k] __x86_indirect_thunk_rax 1.42% [kernel] [k] __dev_queue_xmit 1.36% [kernel] [k] nft_do_chain 1.21% [kernel] [k] __rcu_read_lock perf top (25 n-m skip_sw rules - post patch): 5.12% [kernel] [k] __dev_queue_xmit 4.77% [kernel] [k] nft_do_chain 3.65% [kernel] [k] dev_gro_receive 3.41% [kernel] [k] check_preemption_disabled 3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear 2.88% [kernel] [k] __netif_receive_skb_core.constprop.0 2.49% [kernel] [k] mlx5e_xmit 2.15% [kernel] [k] ip_forward 1.95% [kernel] [k] mlx5e_tc_restore_tunnel 1.92% [kernel] [k] vlan_gro_receive Test setup: DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G Data rate measured on switch (Extreme X690), and DUT connected as a router on a stick, with pktgen and pktsink as VLANs. Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests. Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/ Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29net: sched: cls_api: add filter counterAsbjørn Sloth Tønnesen1-0/+2
Maintain a count of filters per block. Counter updates are protected by cb_lock, which is also used to protect the offload counters. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29net: sched: cls_api: add skip_sw counterAsbjørn Sloth Tønnesen1-0/+1
Maintain a count of skip_sw filters. This counter is protected by the cb_lock, and is updated at the same time as offloadcnt. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29net: phylink: add rxc_always_on flag to phylink_pcsRomain Gantois1-0/+38
Some MAC drivers (e.g. stmmac) require a continuous receive clock signal to be generated by a PCS that is handled by a standalone PCS driver. Such a PCS driver does not have access to a PHY device, thus cannot check the PHY_F_RXC_ALWAYS_ON flag. They cannot check max_requires_rxc in the phylink config either, since it is a private member. Therefore, a new flag is needed to signal to the PCS that it should keep the RX clock signal up at all times. Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-2-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29net: phylink: add PHY_F_RXC_ALWAYS_ON to PHY dev flagsRussell King (Oracle)2-0/+5
Some MAC controllers (e.g. stmmac) require their connected PHY to continuously provide a receive clock signal. This can cause issues in two cases: 1. The clock signal hasn't been started yet by the time the MAC driver initializes its hardware. This can make the initialization fail, as in the case of the rzn1 GMAC1 driver. 2. The clock signal is cut during a power saving event. By the time the MAC is brought back up, the clock signal is still not active since phylink_start hasn't been called yet. This brings us back to case 1. If a PHY driver reads this flag, it should ensure that the receive clock signal is started as soon as possible, and that it isn't brought down when the PHY goes into suspend. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [rgantois: commit log] Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-1-24a74e5c761f@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>