summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2023-10-18net: skb_find_text: Ignore patterns extending past 'to'Phil Sutter1-1/+2
Assume that caller's 'to' offset really represents an upper boundary for the pattern search, so patterns extending past this offset are to be rejected. The old behaviour also was kind of inconsistent when it comes to fragmentation (or otherwise non-linear skbs): If the pattern started in between 'to' and 'from' offsets but extended to the next fragment, it was not found if 'to' offset was still within the current fragment. Test the new behaviour in a kselftest using iptables' string match. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: f72b948dcbb8 ("[NET]: skb_find_text ignores to argument") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18netfilter: nf_tables: de-constify set commit ops function argumentFlorian Westphal1-4/+3
The set backend using this already has to work around this via ugly cast, don't spread this pattern. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: bridge: convert br_netfilter to NF_DROP_REASONFlorian Westphal2-16/+16
errno is 0 because these hooks are called from prerouting and forward. There is no socket that the errno would ever be propagated to. Other netfilter modules (e.g. nf_nat, conntrack, ...) can be converted in a similar way. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: make nftables drops visible in net dropmonitorFlorian Westphal2-4/+8
net_dropmonitor blames core.c:nf_hook_slow. Add NF_DROP_REASON() helper and use it in nft_do_chain(). The helper releases the skb, so exact drop location becomes available. Calling code will observe the NF_STOLEN verdict instead. Adjust nf_hook_slow so we can embed an erro value wih NF_STOLEN verdicts, just like we do for NF_DROP. After this, drop in nftables can be pinpointed to a drop due to a rule or the chain policy. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: nf_nat: mask out non-verdict bits when checking return valueFlorian Westphal1-2/+3
Same as previous change: we need to mask out the non-verdict bits, as upcoming patches may embed an errno value in NF_STOLEN verdicts too. NF_DROP could already do this, but not all called functions do this. Checks that only test ret vs NF_ACCEPT are fine, the 'errno parts' are always 0 for those. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: conntrack: convert nf_conntrack_update to netfilter verdictsFlorian Westphal2-31/+42
This function calls helpers that can return nf-verdicts, but then those get converted to -1/0 as thats what the caller expects. Theoretically NF_DROP could have an errno number set in the upper 24 bits of the return value. Or any of those helpers could return NF_STOLEN, which would result in use-after-free. This is fine as-is, the called functions don't do this yet. But its better to avoid possible future problems if the upcoming patchset to add NF_DROP_REASON() support gains further users, so remove the 0/-1 translation from the picture and pass the verdicts down to the caller. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: nf_tables: mask out non-verdict bits when checking return valueFlorian Westphal2-3/+7
nftables trace infra must mask out the non-verdict bit parts of the return value, else followup changes that 'return errno << 8 | NF_STOLEN' will cause breakage. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: xt_mangle: only check verdict part of return valueFlorian Westphal2-8/+10
These checks assume that the caller only returns NF_DROP without any errno embedded in the upper bits. This is fine right now, but followup patches will start to propagate such errors to allow kfree_skb_drop_reason() in the called functions, those would then indicate 'errno << 8 | NF_STOLEN'. To not break things we have to mask those parts out. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18devlink: document devlink_rel_nested_in_notify() functionJiri Pirko1-0/+14
Add a documentation for devlink_rel_nested_in_notify() describing the devlink instance locking consequences. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18devlink: don't take instance lock for nested handle putJiri Pirko1-14/+3
Lockdep reports following issue: WARNING: possible circular locking dependency detected ------------------------------------------------------ devlink/8191 is trying to acquire lock: ffff88813f32c250 (&devlink->lock_key#14){+.+.}-{3:3}, at: devlink_rel_devlink_handle_put+0x11e/0x2d0 but task is already holding lock: ffffffff8511eca8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (rtnl_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 register_netdevice_notifier_net+0x13/0x30 mlx5_lag_add_mdev+0x51c/0xa00 [mlx5_core] mlx5_load+0x222/0xc70 [mlx5_core] mlx5_init_one_devl_locked+0x4a0/0x1310 [mlx5_core] mlx5_init_one+0x3b/0x60 [mlx5_core] probe_one+0x786/0xd00 [mlx5_core] local_pci_probe+0xd7/0x180 pci_device_probe+0x231/0x720 really_probe+0x1e4/0xb60 __driver_probe_device+0x261/0x470 driver_probe_device+0x49/0x130 __driver_attach+0x215/0x4c0 bus_for_each_dev+0xf0/0x170 bus_add_driver+0x21d/0x590 driver_register+0x133/0x460 vdpa_match_remove+0x89/0xc0 [vdpa] do_one_initcall+0xc4/0x360 do_init_module+0x22d/0x760 load_module+0x51d7/0x6750 init_module_from_file+0xd2/0x130 idempotent_init_module+0x326/0x5a0 __x64_sys_finit_module+0xc1/0x130 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #2 (mlx5_intf_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_register_device+0x3e/0xd0 [mlx5_core] mlx5_init_one_devl_locked+0x8fa/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #1 (&dev->lock_key#8){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_init_one_devl_locked+0x45/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #0 (&devlink->lock_key#14){+.+.}-{3:3}: check_prev_add+0x1af/0x2300 __lock_acquire+0x31d7/0x4eb0 lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 devlink_rel_devlink_handle_put+0x11e/0x2d0 devlink_nl_port_fill+0xddf/0x1b00 devlink_port_notify+0xb5/0x220 __devlink_port_type_set+0x151/0x510 devlink_port_netdevice_event+0x17c/0x220 notifier_call_chain+0x97/0x240 unregister_netdevice_many_notify+0x876/0x1790 unregister_netdevice_queue+0x274/0x350 unregister_netdev+0x18/0x20 mlx5e_vport_rep_unload+0xc5/0x1c0 [mlx5_core] __esw_offloads_unload_rep+0xd8/0x130 [mlx5_core] mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core] mlx5_esw_offloads_unload_rep+0x85/0xc0 [mlx5_core] mlx5_eswitch_unload_sf_vport+0x41/0x90 [mlx5_core] mlx5_devlink_sf_port_del+0x120/0x280 [mlx5_core] genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 other info that might help us debug this: Chain exists of: &devlink->lock_key#14 --> mlx5_intf_mutex --> rtnl_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(mlx5_intf_mutex); lock(rtnl_mutex); lock(&devlink->lock_key#14); Problem is taking the devlink instance lock of nested instance when RTNL is already held. To fix this, don't take the devlink instance lock when putting nested handle. Instead, rely on the preparations done by previous two patches to be able to access device pointer and obtain netns id without devlink instance lock held. Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18devlink: take device reference for devlink objectJiri Pirko1-1/+2
In preparation to allow to access device pointer without devlink instance lock held, make sure the device pointer is usable until devlink_release() is called. Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18devlink: call peernet2id_alloc() with net pointer under RCU read lockJiri Pirko1-3/+9
peernet2id_alloc() allows to be called lockless with peer net pointer obtained in RCU critical section and makes sure to return ns ID if net namespaces is not being removed concurrently. Benefit from read_pnet_rcu() helper addition, use it to obtain net pointer under RCU read lock and pass it to peernet2id_alloc() to get ns ID. Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18net: bridge: Set strict_start_type for br_policyJohannes Nixdorf1-0/+2
Set any new attributes added to br_policy to be parsed strictly, to prevent userspace from passing garbage. Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-4-32cddff87758@avm.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18net: bridge: Add netlink knobs for number / max learned FDB entriesJohannes Nixdorf1-1/+14
The previous patch added accounting and a limit for the number of dynamically learned FDB entries per bridge. However it did not provide means to actually configure those bounds or read back the count. This patch does that. Two new netlink attributes are added for the accounting and limit of dynamically learned FDB entries: - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for a single bridge. - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for the bridge. The new attributes are used like this: # ip link add name br up type bridge fdb_max_learned 256 # ip link add name v1 up master br type veth peer v2 # ip link set up dev v2 # mausezahn -a rand -c 1024 v2 0.01 seconds (90877 packets per second # bridge fdb | grep -v permanent | wc -l 256 # ip -d link show dev br 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...] [...] fdb_n_learned 256 fdb_max_learned 256 Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18net: bridge: Track and limit dynamically learned FDB entriesJohannes Nixdorf2-2/+37
A malicious actor behind one bridge port may spam the kernel with packets with a random source MAC address, each of which will create an FDB entry, each of which is a dynamic allocation in the kernel. There are roughly 2^48 different MAC addresses, further limited by the rhashtable they are stored in to 2^31. Each entry is of the type struct net_bridge_fdb_entry, which is currently 128 bytes big. This means the maximum amount of memory allocated for FDB entries is 2^31 * 128B = 256GiB, which is too much for most computers. Mitigate this by maintaining a per bridge count of those automatically generated entries in fdb_n_learned, and a limit in fdb_max_learned. If the limit is hit new entries are not learned anymore. For backwards compatibility the default setting of 0 disables the limit. User-added entries by netlink or from bridge or bridge port addresses are never blocked and do not count towards that limit. Introduce a new fdb entry flag BR_FDB_DYNAMIC_LEARNED to keep track of whether an FDB entry is included in the count. The flag is enabled for dynamically learned entries, and disabled for all other entries. This should be equivalent to BR_FDB_ADDED_BY_USER and BR_FDB_LOCAL being unset, but contrary to the two flags it can be toggled atomically. Atomicity is required here, as there are multiple callers that modify the flags, but are not under a common lock (br_fdb_update is the exception for br->hash_lock, br_fdb_external_learn_add for RTNL). Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-2-32cddff87758@avm.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entryJohannes Nixdorf1-3/+4
In preparation of the following fdb limit for dynamically learned entries, allow fdb_create to detect that the entry was added by the user. This way it can skip applying the limit in this case. Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-1-32cddff87758@avm.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: openvswitch: Annotate struct mask_array with __counted_byChristophe JAILLET1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/ca5c8049f58bb933f231afd0816e30a5aaa0eddd.1697264974.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17net: openvswitch: Use struct_size()Christophe JAILLET1-5/+2
Use struct_size() instead of hand writing it. This is less verbose and more robust. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/e5122b4ff878cbf3ed72653a395ad5c4da04dc1e.1697264974.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17net: gso_test: release each segment individuallyFlorian Westphal1-5/+9
consume_skb() doesn't walk the segment list, so segments other than the first are leaked. Move this skb_consume call into the loop. Cc: Willem de Bruijn <willemb@google.com> Fixes: b3098d32ed6e ("net: add skb_segment kunit test") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-17Merge tag 'for-netdev' of ↵Jakub Kicinski10-22/+122
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17page_pool: fragment API support for 32-bit arch with 64-bit DMAYunsheng Lin1-5/+9
Currently page_pool_alloc_frag() is not supported in 32-bit arch with 64-bit DMA because of the overlap issue between pp_frag_count and dma_addr_upper in 'struct page' for those arches, which seems to be quite common, see [1], which means driver may need to handle it when using fragment API. It is assumed that the combination of the above arch with an address space >16TB does not exist, as all those arches have 64b equivalent, it seems logical to use the 64b version for a system with a large address space. It is also assumed that dma address is page aligned when we are dma mapping a page aligned buffer, see [2]. That means we're storing 12 bits of 0 at the lower end for a dma address, we can reuse those bits for the above arches to support 32b+12b, which is 16TB of memory. If we make a wrong assumption, a warning is emitted so that user can report to us. 1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/ 2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/ Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Guillaume Tucker <guillaume.tucker@collabora.com> CC: Matthew Wilcox <willy@infradead.org> CC: Linux-MM <linux-mm@kvack.org> Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17cgroup, netclassid: on modifying netclassid in cgroup, only consider the ↵Liansen Zhai1-0/+6
main process. When modifying netclassid, the command("echo 0x100001 > net_cls.classid") will take more time on many threads of one process, because the process create many fds. for example, one process exists 28000 fds and 60000 threads, echo command will task 45 seconds. Now, we only consider the main process when exec "iterate_fd", and the time is about 52 milliseconds. Signed-off-by: Liansen Zhai <zhailiansen@kuaishou.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231012090330.29636-1-zhailiansen@kuaishou.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17tcp: Set pingpong threshold via sysctlHaiyang Zhang3-2/+12
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16net, sched: Add tcf_set_drop_reason for {__,}tcf_classifyDaniel Borkmann1-6/+20
Add an initial user for the newly added tcf_set_drop_reason() helper to set the drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify(). Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic fallback indicator to mark drop locations. Where needed, such locations can be converted to more specific codes, for example, when hitting the reclassification limit, etc. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Victor Nogueira <victor@mojatatu.com> Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16net, sched: Make tc-related drop reason more flexibleDaniel Borkmann1-5/+10
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason. Victor kicked-off an initial proposal to make this more flexible by disambiguating verdict from return code by moving the verdict into struct tcf_result and letting tcf_classify() return a negative error. If hit, then two new drop reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual error codes would have required to attach to tcf_classify via kprobe/kretprobe to more deeply debug skb and the returned error. In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more extensible, it can be addressed in a more straight forward way, that is: Instead of placing the verdict into struct tcf_result, we can just put the drop reason in there, which does not require changes throughout various classful schedulers given the existing verdict logic can stay as is. Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason to disambiguate between an error or an intentional drop. New drop reason error codes can be added successively to the tc code base. For internal error locations which have not yet been annotated with a SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones if it is found that they would be useful for troubleshooting. While drop reasons have infrastructure for subsystem specific error codes which are currently used by mac80211 and ovs, Jakub mentioned that it is preferred for tc to use the enum skb_drop_reason core codes given it is a better fit and currently the tooling support is better, too. With regards to the latter: [...] I think Alastair (bpftrace) is working on auto-prettifying enums when bpftrace outputs maps. So we can do something like: $ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }' Attaching 1 probe... ^C @[SKB_DROP_REASON_TC_INGRESS]: 2 @[SKB_CONSUMED]: 34 ^^^^^^^^^^^^ names!! Auto-magically. [...] Add a small helper tcf_set_drop_reason() which can be used to set the drop reason into the tcf_result. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Victor Nogueira <victor@mojatatu.com> Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16ipv4: use tunnel flow flags for tunnel route lookupsBeniamino Galvani1-0/+1
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a new field to struct ip_tunnel_key to control route lookups. Currently the flag is used by vxlan and geneve tunnels; use it also in udp_tunnel_dst_lookup() so that it affects all tunnel types relying on this function. Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: add new arguments to udp_tunnel_dst_lookup()Beniamino Galvani1-13/+13
We want to make the function more generic so that it can be used by other UDP tunnel implementations such as geneve and vxlan. To do that, add the following arguments: - source and destination UDP port; - ifindex of the output interface, needed by vxlan; - the tos, because in some cases it is not taken from struct ip_tunnel_info (for example, when it's inherited from the inner packet); - the dst cache, because not all tunnel types (e.g. vxlan) want to use the one from struct ip_tunnel_info. With these parameters, the function no longer needs the full struct ip_tunnel_info as argument and we can pass only the relevant part of it (struct ip_tunnel_key). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: remove "proto" argument from udp_tunnel_dst_lookup()Beniamino Galvani1-2/+2
The function is now UDP-specific, the protocol is always IPPROTO_UDP. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: rename and move ip_route_output_tunnel()Beniamino Galvani2-48/+48
At the moment ip_route_output_tunnel() is used only by bareudp. Ideally, other UDP tunnel implementations should use it, but to do so the function needs to accept new parameters that are specific for UDP tunnels, such as the ports. Prepare for these changes by renaming the function to udp_tunnel_dst_lookup() and move it to file net/ipv4/udp_tunnel_core.c. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock: enable setting SO_ZEROCOPYArseniy Krasnov1-2/+43
For AF_VSOCK, zerocopy tx mode depends on transport, so this option must be set in AF_VSOCK implementation where transport is accessible (if transport is not set during setting SO_ZEROCOPY: for example socket is not connected, then SO_ZEROCOPY will be enabled, but once transport will be assigned, support of this type of transmission will be checked). To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT bit, thus handling SOL_SOCKET option operations, but all of them except SO_ZEROCOPY will be forwarded to the generic handler by calling 'sock_setsockopt()'. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock/loopback: support MSG_ZEROCOPY for transportArseniy Krasnov1-0/+6
Add 'msgzerocopy_allow()' callback for loopback transport. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock/virtio: support MSG_ZEROCOPY for transportArseniy Krasnov1-0/+7
Add 'msgzerocopy_allow()' callback for virtio transport. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock: enable SOCK_SUPPORT_ZC bitArseniy Krasnov1-0/+6
This bit is used by io_uring in case of zerocopy tx mode. io_uring code checks, that socket has this feature. This patch sets it in two places: 1) For socket in 'connect()' call. 2) For new socket which is returned by 'accept()' call. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock: check for MSG_ZEROCOPY support on sendArseniy Krasnov1-0/+6
This feature totally depends on transport, so if transport doesn't support it, return error. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock: read from socket's error queueArseniy Krasnov1-0/+6
This adds handling of MSG_ERRQUEUE input flag in receive call. This flag is used to read socket's error queue instead of data queue. Possible scenario of error queue usage is receiving completions for transmission with MSG_ZEROCOPY flag. This patch also adds new defines: 'SOL_VSOCK' and 'VSOCK_RECVERR'. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15vsock: set EPOLLERR on non-empty error queueArseniy Krasnov1-1/+1
If socket's error queue is not empty, EPOLLERR must be set. Otherwise, reader of error queue won't detect data in it using EPOLLERR bit. Currently for AF_VSOCK this is actual only with MSG_ZEROCOPY, as this feature is the only user of an error queue of the socket. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-14appletalk: remove special handling code for ipddpLukas Bulwahn1-36/+0
After commit 1dab47139e61 ("appletalk: remove ipddp driver") removes the config IPDDP, there is some minor code clean-up possible in the appletalk network layer. Remove some code in appletalk layer after the ipddp driver is gone. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231012063443.22368-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not setMartin KaFai Lau1-1/+1
It was reported that there is a compiler warning on the unused variable "sin_addr_len" in af_inet.c when CONFIG_CGROUP_BPF is not set. This patch is to address it similar to the ipv6 counterpart in inet6_getname(). It is to "return sin_addr_len;" instead of "return sizeof(*sin);". Fixes: fefba7d1ae19 ("bpf: Propagate modified uaddrlen from cgroup sockaddr programs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/bpf/20231013185702.3993710-1-martin.lau@linux.dev Closes: https://lore.kernel.org/bpf/20231013114007.2fb09691@canb.auug.org.au/
2023-10-13net: fix IPSTATS_MIB_OUTFORWDATAGRAMS increment after fragment checkHeng Guo2-6/+4
Reproduce environment: network with 3 VM linuxs is connected as below: VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3 VM1: eth0 ip: 192.168.122.207 MTU 1800 VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500 VM3: eth0 ip: 192.168.123.240 MTU 1800 Reproduce: VM1 send 1600 bytes UDP data to VM3 using tools scapy with flags='DF'. scapy command: send(IP(dst="192.168.123.240",flags='DF')/UDP()/str('0'*1600),count=1, inter=1.000000) Result: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 2 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is always keeping 2 without increment. Issue description and patch: ip_exceeds_mtu() in ip_forward() drops this IP datagram because skb len (1600 sending by scapy) is over MTU(1500 in VM2) if "DF" is set. According to RFC 4293 "3.2.3. IP Statistics Tables", +-------+------>------+----->-----+----->-----+ | InForwDatagrams (6) | OutForwDatagrams (6) | | V +->-+ OutFragReqds | InNoRoutes | | (packets) / (local packet (3) | | | IF is that of the address | +--> OutFragFails | and may not be the receiving IF) | | (packets) the IPSTATS_MIB_OUTFORWDATAGRAMS should be counted before fragment check. The existing implementation, instead, would incease the counter after fragment check: ip_exceeds_mtu() in ipv4 and ip6_pkt_too_big() in ipv6. So do patch to move IPSTATS_MIB_OUTFORWDATAGRAMS counter to ip_forward() for ipv4 and ip6_forward() for ipv6. Test result with patch: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 3 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is updated from 2 to 3. Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Signed-off-by: Heng Guo <heng.guo@windriver.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231011015137.27262-1-heng.guo@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13tls: use fixed size for tls_offload_context_{tx,rx}.driver_stateSabrina Dubroca1-2/+2
driver_state is a flex array, but is always allocated by the tls core to a fixed size (TLS_DRIVER_STATE_SIZE_{TX,RX}). Simplify the code by making that size explicit so that sizeof(struct tls_offload_context_{tx,rx}) works. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: validate crypto_info in a separate helperSabrina Dubroca1-24/+27
Simplify do_tls_setsockopt_conf a bit. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: remove tls_context argument from tls_set_device_offloadSabrina Dubroca3-10/+10
It's not really needed since we end up refetching it as tls_ctx. We can also remove the NULL check, since we have already dereferenced ctx in do_tls_setsockopt_conf. While at it, fix up the reverse xmas tree ordering. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: remove tls_context argument from tls_set_sw_offloadSabrina Dubroca4-14/+12
It's not really needed since we end up refetching it as tls_ctx. We can also remove the NULL check, since we have already dereferenced ctx in do_tls_setsockopt_conf. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: add a helper to allocate/initialize offload_ctx_txSabrina Dubroca1-14/+25
Simplify tls_set_device_offload a bit. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: also use init_prot_info in tls_set_device_offloadSabrina Dubroca3-14/+18
Most values are shared. Nonce size turns out to be equal to IV size for all offloadable ciphers. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: move tls_prot_info initialization out of tls_set_sw_offloadSabrina Dubroca1-28/+34
Simplify tls_set_sw_offload, and allow reuse for the tls_device code. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: extract context alloc/initialization out of tls_set_sw_offloadSabrina Dubroca1-35/+51
Simplify tls_set_sw_offload a bit. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: store iv directly within cipher_contextSabrina Dubroca3-23/+5
TLS_MAX_IV_SIZE + TLS_MAX_SALT_SIZE is 20B, we don't get much benefit in cipher_context's size and can simplify the init code a bit. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: rename MAX_IV_SIZE to TLS_MAX_IV_SIZESabrina Dubroca4-6/+6
It's defined in include/net/tls.h, avoid using an overly generic name. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tls: store rec_seq directly within cipher_contextSabrina Dubroca3-21/+4
TLS_MAX_REC_SEQ_SIZE is 8B, we don't get anything by using kmalloc. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>