summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2023-01-27net: add missing includes of linux/sched/clock.hJakub Kicinski2-0/+2
Number of files depend on linux/sched/clock.h getting included by linux/skbuff.h which soon will no longer be the case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-27net: skbuff: drop the linux/textsearch.h includeJakub Kicinski1-0/+1
This include was added for skb_find_text() but all we need there is a forward declaration of struct ts_config. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-27xfrm: extend add state callback to set failure reasonLeon Romanovsky2-5/+3
Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27xfrm: extend add policy callback to set failure reasonLeon Romanovsky1-2/+1
Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26mptcp: userspace pm: use a single point of exitMatthieu Baerts1-2/+3
Like in all other functions in this file, a single point of exit is used when extra operations are needed: unlock, decrement refcount, etc. There is no functional change for the moment but it is better to do the same here to make sure all cleanups are done in case of intermediate errors. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26mptcp: propagate sk_ipv6only to subflowsMatthieu Baerts1-0/+1
Usually, attributes are propagated to subflows as well. Here, if subflows are created by other ways than the MPTCP path-manager, it is important to make sure they are in v6 if it is asked by the userspace. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addressesPaolo Abeni1-27/+31
Currently the in-kernel PM arbitrary enforces that created subflow's family must match the main MPTCP socket while the RFC allows mixing IPv4 and IPv6 subflows. This patch changes the in-kernel PM logic to create subflows matching the currently selected source (or destination) address. IPv4 sockets can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as long as the source and destination matches. A helper, previously introduced is used to ease family matching checks, taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269 Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26icmp: Add counters for rate limitsJamie Bainbridge4-3/+13
There are multiple ICMP rate limiting mechanisms: * Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec * v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask * v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask However, when ICMP output is limited, there is no way to tell which limit has been hit or even if the limits are responsible for the lack of ICMP output. Add counters for each of the cases above. As we are within local_bh_disable(), use the __INC stats variant. Example output: # nstat -sz "*RateLimit*" IcmpOutRateLimitGlobal 134 0.0 IcmpOutRateLimitHost 770 0.0 Icmp6OutRateLimitHost 84 0.0 Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com> Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26inet: Add IP_LOCAL_PORT_RANGE socket optionJakub Sitnicki5-5/+44
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26net: Kconfig: fix spellosRandy Dunlap2-2/+2
Fix spelling in net/ Kconfig files. (reported by codespell) Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: coreteam@netfilter.org Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25net: ethtool: fix NULL pointer dereference in pause_prepare_data()Vladimir Oltean1-1/+1
In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> pause_prepare_data struct genl_info *info will be passed as NULL, and pause_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net: ethtool: fix NULL pointer dereference in stats_prepare_data()Vladimir Oltean1-1/+1
In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> stats_prepare_data struct genl_info *info will be passed as NULL, and stats_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: De-tangle ism and smc device initializationStefan Raspl4-63/+39
The struct device for ISM devices was part of struct smcd_dev. Move to struct ism_dev, provide a new API call in struct smcd_ops, and convert existing SMCD code accordingly. Furthermore, remove struct smcd_dev from struct ism_dev. This is the final part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25s390/ism: Consolidate SMC-D-related codeStefan Raspl1-24/+39
The ism module had SMC-D-specific code sprinkled across the entire module. We are now consolidating the SMC-D-specific parts into the latter parts of the module, so it becomes more clear what code is intended for use with ISM, and which parts are glue code for usage in the context of SMC-D. This is the fourth part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Separate SMC-D and ISM APIsStefan Raspl3-6/+14
We separate the code implementing the struct smcd_ops API in the ISM device driver from the functions that may be used by other exploiters of ISM devices. Note: We start out small, and don't offer the whole breadth of the ISM device for public use, as many functions are specific to or likely only ever used in the context of SMC-D. This is the third part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Register SMC-D as ISM clientStefan Raspl4-26/+68
Register the smc module with the new ism device driver API. This is the second part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/ism: Add new API for client registrationStefan Raspl1-2/+2
Add a new API that allows other drivers to concurrently access ISM devices. To do so, we introduce a new API that allows other modules to register for ISM device usage. Furthermore, we move the GID to struct ism, where it belongs conceptually, and rename and relocate struct smcd_event to struct ism_event. This is the first part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25net/smc: Terminate connections prior to device removalStefan Raspl1-2/+2
Removing an ISM device prior to terminating its associated connections doesn't end well. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25devlink: remove a dubious assumption in fmsg dumpingJakub Kicinski1-1/+1
Build bot detects that err may be returned uninitialized in devlink_fmsg_prepare_skb(). This is not really true because all fmsgs users should create at least one outer nest, and therefore fmsg can't be completely empty. That said the assumption is not trivial to confirm, so let's follow the bots advice, anyway. This code does not seem to have changed since its inception in commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25ipv6: Make ip6_route_output_flags_noref() static.Guillaume Nault1-4/+4
This function is only used in net/ipv6/route.c and has no reason to be visible outside of it. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25netlink: fix spelling mistake in dump size assertJakub Kicinski2-2/+2
Commit 2c7bc10d0f7b ("netlink: add macro for checking dump ctx size") misspelled the name of the assert as asset, missing an R. Reported-by: Ido Schimmel <idosch@idosch.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24net: fou: use policy and operation tables generated from the specJakub Kicinski4-41/+81
Generate and plug in the spec-based tables. A little bit of renaming is needed in the FOU code. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: fou: rename the source for linkingJakub Kicinski2-0/+1
We'll need to link two objects together to form the fou module. This means the source can't be called fou, the build system expects fou.o to be the combined object. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24act_mirred: use the backlog for nested calls to mirred ingressDavide Caratti1-0/+7
William reports kernel soft-lockups on some OVS topologies when TC mirred egress->ingress action is hit by local TCP traffic [1]. The same can also be reproduced with SCTP (thanks Xin for verifying), when client and server reach themselves through mirred egress to ingress, and one of the two peers sends a "heartbeat" packet (from within a timer). Enqueueing to backlog proved to fix this soft lockup; however, as Cong noticed [2], we should preserve - when possible - the current mirred behavior that counts as "overlimits" any eventual packet drop subsequent to the mirred forwarding action [3]. A compromise solution might use the backlog only when tcf_mirred_act() has a nest level greater than one: change tcf_mirred_forward() accordingly. Also, add a kselftest that can reproduce the lockup and verifies TC mirred ability to account for further packet drops after TC mirred egress->ingress (when the nest level is 1). [1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/ [3] such behavior is not guaranteed: for example, if RPS or skb RX timestamping is enabled on the mirred target device, the kernel can defer receiving the skb and return NET_RX_SUCCESS inside tcf_mirred_forward(). Reported-by: William Zhao <wizhao@redhat.com> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net/sched: act_mirred: better wording on protection against excessive stack ↵Davide Caratti1-8/+8
growth with commit e2ca070f89ec ("net: sched: protect against stack overflow in TC act_mirred"), act_mirred protected itself against excessive stack growth using per_cpu counter of nested calls to tcf_mirred_act(), and capping it to MIRRED_RECURSION_LIMIT. However, such protection does not detect recursion/loops in case the packet is enqueued to the backlog (for example, when the mirred target device has RPS or skb timestamping enabled). Change the wording from "recursion" to "nesting" to make it more clear to readers. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net: dsa: microchip: enable port queues for tc mqprioArun Ramadoss1-0/+15
LAN937x family of switches has 8 queues per port where the KSZ switches has 4 queues per port. By default, only one queue per port is enabled. The queues are configurable in 2, 4 or 8. This patch add 8 number of queues for LAN937x and 4 for other switches. In the tag_ksz.c file, prioirty of the packet is queried using the skb buffer and the corresponding value is updated in the tag. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24net: avoid irqsave in skb_defer_free_flushJesper Dangaard Brouer1-3/+2
The spin_lock irqsave/restore API variant in skb_defer_free_flush can be replaced with the faster spin_lock irq variant, which doesn't need to read and restore the CPU flags. Using the unconditional irq "disable/enable" API variant is safe, because the skb_defer_free_flush() function is only called during NAPI-RX processing in net_rx_action(), where it is known the IRQs are enabled. Expected gain is 14 cycles from avoiding reading and restoring CPU flags in a spin_lock_irqsave/restore operation, measured via a microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz. Microbenchmark overhead of spin_lock+unlock: - spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns - spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns We don't expect to see a measurable packet performance gain, as skb_defer_free_flush() is called infrequently once per NIC device NAPI bulk cycle and conditionally only if SKBs have been deferred by other CPUs via skb_attempt_defer_free(). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24net: fix kfree_skb_list use of skb_mark_not_on_listJesper Dangaard Brouer1-2/+0
A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via invoking skb_mark_not_on_list(). In this patch we choose to remove the skb_mark_not_on_list() call as it isn't necessary. It would be possible and correct to call skb_mark_not_on_list() only when __kfree_skb_reason() returns true, meaning the SKB is ready to be free'ed, as it calls/check skb_unref(). This fix is needed as kfree_skb_list() is also invoked on skb_shared_info frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(), which takes a reference on all SKBs in the list. This implies the invariant that all SKBs in the list must have the same refcnt, when using kfree_skb_list(). Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24Merge tag 'wireless-next-2023-01-23' of ↵Jakub Kicinski15-91/+140
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.3 First set of patches for v6.3. The most important change here is that the old Wireless Extension user space interface is not supported on Wi-Fi 7 devices at all. We also added a warning if anyone with modern drivers (ie. cfg80211 and mac80211 drivers) tries to use Wireless Extensions, everyone should switch to using nl80211 interface instead. Static WEP support is removed, there wasn't any driver using that anyway so there's no user impact. Otherwise it's smaller features and fixes as usual. Note: As mt76 had tricky conflicts due to the fixes in wireless tree, we decided to merge wireless into wireless-next to solve them easily. There should not be any merge problems anymore. Major changes: cfg80211 - remove never used static WEP support - warn if Wireless Extention interface is used with cfg80211/mac80211 drivers - stop supporting Wireless Extensions with Wi-Fi 7 devices - support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting rfkill - add GPIO DT support bitfield - add FIELD_PREP_CONST() mt76 - per-PHY LED support rtw89 - support new Bluetooth co-existance version rtl8xxxu - support RTL8188EU * tag 'wireless-next-2023-01-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (123 commits) wifi: wireless: deny wireless extensions on MLO-capable devices wifi: wireless: warn on most wireless extension usage wifi: mac80211: drop extra 'e' from ieeee80211... name wifi: cfg80211: Deduplicate certificate loading bitfield: add FIELD_PREP_CONST() wifi: mac80211: add kernel-doc for EHT structure mac80211: support minimal EHT rate reporting on RX wifi: mac80211: Add HE MU-MIMO related flags in ieee80211_bss_conf wifi: mac80211: Add VHT MU-MIMO related flags in ieee80211_bss_conf wifi: cfg80211: Use MLD address to indicate MLD STA disconnection wifi: cfg80211: Support 32 bytes KCK key in GTK rekey offload wifi: cfg80211: Fix extended KCK key length check in nl80211_set_rekey_data() wifi: cfg80211: remove support for static WEP wifi: rtl8xxxu: Dump the efuse only for untested devices wifi: rtl8xxxu: Print the ROM version too wifi: rtw88: Use non-atomic sta iterator in rtw_ra_mask_info_update() wifi: rtw88: Use rtw_iterate_vifs() for rtw_vif_watch_dog_iter() wifi: rtw88: Move register access from rtw_bf_assoc() outside the RCU wifi: rtl8xxxu: Use a longer retry limit of 48 wifi: rtl8xxxu: Report the RSSI to the firmware ... ==================== Link: https://lore.kernel.org/r/20230123103338.330CBC433EF@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23ethtool: Add and use ethnl_update_bool.David S. Miller2-1/+27
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net: dsa: add plumbing for changing and getting MAC merge layer stateVladimir Oltean1-0/+37
The DSA core is in charge of the ethtool_ops of the net devices associated with switch ports, so in case a hardware driver supports the MAC merge layer, DSA must pass the callbacks through to the driver. Add support for precisely that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net: ethtool: add helpers for aggregate statisticsVladimir Oltean1-0/+127
When a pMAC exists but the driver is unable to atomically query the aggregate eMAC+pMAC statistics, the user should be given back at least the sum of eMAC and pMAC counters queried separately. This is a generic problem, so add helpers in ethtool to do this operation, if the driver doesn't have a better way to report aggregate stats. Do this in a way that does not require changes to these functions when new stats are added (basically treat the structures as an array of u64 values, except for the first element which is the stats source). In include/linux/ethtool.h, there is already a section where helper function prototypes should be placed. The trouble is, this section is too early, before the definitions of struct ethtool_eth_mac_stats et.al. Move that section at the end and append these new helpers to it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)Vladimir Oltean5-3/+99
IEEE 802.3-2018 clause 99 defines a MAC Merge sublayer which contains an Express MAC and a Preemptible MAC. Both MACs are hidden to higher and lower layers and visible as a single MAC (packet classification to eMAC or pMAC on TX is done based on priority; classification on RX is done based on SFD). For devices which support a MAC Merge sublayer, it is desirable to retrieve individual packet counters from the eMAC and the pMAC, as well as aggregate statistics (their sum). Introduce a new ETHTOOL_A_STATS_SRC attribute which is part of the policy of ETHTOOL_MSG_STATS_GET and, and an ETHTOOL_A_PAUSE_STATS_SRC which is part of the policy of ETHTOOL_MSG_PAUSE_GET (accepted when ETHTOOL_FLAG_STATS is set in the common ethtool header). Both of these take values from enum ethtool_mac_stats_src, defaulting to "aggregate" in the absence of the attribute. Existing drivers do not need to pay attention to this enum which was added to all driver-facing structures, just the ones which report the MAC merge layer as supported. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net: ethtool: add support for MAC Merge layerVladimir Oltean4-2/+280
The MAC merge sublayer (IEEE 802.3-2018 clause 99) is one of 2 specifications (the other being Frame Preemption; IEEE 802.1Q-2018 clause 6.7.2), which work together to minimize latency caused by frame interference at TX. The overall goal of TSN is for normal traffic and traffic with a bounded deadline to be able to cohabitate on the same L2 network and not bother each other too much. The standards achieve this (partly) by introducing the concept of preemptible traffic, i.e. Ethernet frames that have a custom value for the Start-of-Frame-Delimiter (SFD), and these frames can be fragmented and reassembled at L2 on a link-local basis. The non-preemptible frames are called express traffic, they are transmitted using a normal SFD, and they can preempt preemptible frames, therefore having lower latency, which can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo frames around 9K). Preemption is not recursive, i.e. a P frame cannot preempt another P frame. Preemption also does not depend upon priority, or otherwise said, an E frame with prio 0 will still preempt a P frame with prio 7. In terms of implementation, the standards talk about the presence of an express MAC (eMAC) which handles express traffic, and a preemptible MAC (pMAC) which handles preemptible traffic, and these MACs are multiplexed on the same MII by a MAC merge layer. To support frame preemption, the definition of the SFD was generalized to SMD (Start-of-mPacket-Delimiter), where an mPacket is essentially an Ethernet frame fragment, or a complete frame. Stations unaware of an SMD value different from the standard SFD will treat P frames as error frames. To prevent that from happening, a negotiation process is defined. On RX, packets are dispatched to the eMAC or pMAC after being filtered by their SMD. On TX, the eMAC/pMAC classification decision is taken by the 802.1Q spec, based on packet priority (each of the 8 user priority values may have an admin-status of preemptible or express). The MAC Merge layer and the Frame Preemption parameters have some degree of independence in terms of how software stacks are supposed to deal with them. The activation of the MM layer is supposed to be controlled by an LLDP daemon (after it has been communicated that the link partner also supports it), after which a (hardware-based or not) verification handshake takes place, before actually enabling the feature. So the process is intended to be relatively plug-and-play. Whereas FP settings are supposed to be coordinated across a network using something approximating NETCONF. The support contained here is exclusively for the 802.3 (MAC Merge) portions and not for the 802.1Q (Frame Preemption) parts. This API is sufficient for an LLDP daemon to do its job. The FP adminStatus variable from 802.1Q is outside the scope of an LLDP daemon. I have taken a few creative licenses and augmented the Linux kernel UAPI compared to the standard managed objects recommended by IEEE 802.3. These are: - ETHTOOL_A_MM_PMAC_ENABLED: According to Figure 99-6: Receive Processing state diagram, a MAC Merge layer is always supposed to be able to receive P frames. However, this implies keeping the pMAC powered on, which will consume needless power in applications where FP will never be used. If LLDP is used, the reception of an Additional Ethernet Capabilities TLV from the link partner is sufficient indication that the pMAC should be enabled. So my proposal is that in Linux, we keep the pMAC turned off by default and that user space turns it on when needed. - ETHTOOL_A_MM_VERIFY_ENABLED: The IEEE managed object is called aMACMergeVerifyDisableTx. I opted for consistency (positive logic) in the boolean netlink attributes offered, so this is also positive here. Other than the meaning being reversed, they correspond to the same thing. - ETHTOOL_A_MM_MAX_VERIFY_TIME: I found it most reasonable for a LLDP daemon to maximize the verifyTime variable (delay between SMD-V transmissions), to maximize its chances that the LP replies. IEEE says that the verifyTime can range between 1 and 128 ms, but the NXP ENETC stupidly keeps this variable in a 7 bit register, so the maximum supported value is 127 ms. I could have chosen to hardcode this in the LLDP daemon to a lower value, but why not let the kernel expose its supported range directly. - ETHTOOL_A_MM_TX_MIN_FRAG_SIZE: the standard managed object is called aMACMergeAddFragSize, and expresses the "additional" fragment size (on top of ETH_ZLEN), whereas this expresses the absolute value of the fragment size. - ETHTOOL_A_MM_RX_MIN_FRAG_SIZE: there doesn't appear to exist a managed object mandated by the standard, but user space clearly needs to know what is the minimum supported fragment size of our local receiver, since LLDP must advertise a value no lower than that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net/sock: Introduce trace_sk_data_ready()Peilin Ye19-0/+62
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski36-330/+396
drivers/net/ipa/ipa_interrupt.c drivers/net/ipa/ipa_interrupt.h 9ec9b2a30853 ("net: ipa: disable ipa interrupt during suspend") 8e461e1f092b ("net: ipa: introduce ipa_interrupt_enable()") d50ed3558719 ("net: ipa: enable IPA interrupt handlers separate from registration") https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/ https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20tcp: fix rate_app_limited to default to 1David Morley1-0/+2
The initial default value of 0 for tp->rate_app_limited was incorrect, since a flow is indeed application-limited until it first sends data. Fixing the default to be 1 is generally correct but also specifically will help user-space applications avoid using the initial tcpi_delivery_rate value of 0 that persists until the connection has some non-zero bandwidth sample. Fixes: eb8329e0a04d ("tcp: export data delivery rate") Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20net: dcb: add helper functions to retrieve PCP and DSCP rewrite mapsDaniel Machon1-0/+52
Add two new helper functions to retrieve a mapping of priority to PCP and DSCP bitmasks, where each bitmap contains ones in positions that match a rewrite entry. dcb_ieee_getrewr_prio_dscp_mask_map() reuses the dcb_ieee_app_prio_map, as this struct is already used for a similar mapping in the app table. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20net: dcb: add new rewrite tableDaniel Machon1-1/+112
Add new rewrite table and all the required functions, offload hooks and bookkeeping for maintaining it. The rewrite table reuses the app struct, and the entire set of app selectors. As such, some bookeeping code can be shared between the rewrite- and the APP table. New functions for getting, setting and deleting entries has been added. Apart from operating on the rewrite list, these functions do not emit a DCB_APP_EVENT when the list os modified. The new dcb_getrewr does a lookup based on selector and priority and returns the protocol, so that mappings from priority to protocol, for a given selector and ifindex is obtained. Also, a new nested attribute has been added, that encapsulates one or more app structs. This attribute is used to distinguish the two tables. The dcb_lock used for the APP table is reused for the rewrite table. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20net: dcb: add new common function for set/del of app/rewr entriesDaniel Machon1-55/+45
In preparation for DCB rewrite. Add a new function for setting and deleting both app and rewrite entries. Moving this into a separate function reduces duplicate code, as both type of entries requires the same set of checks. The function will now iterate through a configurable nested attribute (app or rewrite attr), validate each attribute and call the appropriate set- or delete function. Note that this function always checks for nla_len(attr_itr) < sizeof(struct dcb_app), which was only done in dcbnl_ieee_set and not in dcbnl_ieee_del prior to this patch. This means, that any userspace tool that used to shove in data < sizeof(struct dcb_app) would now receive -ERANGE. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20net: dcb: modify dcb_app_add to take list_head ptr as parameterDaniel Machon1-4/+5
In preparation to DCB rewrite. Modify dcb_app_add to take new struct list_head * as parameter, to make the used list configurable. This is done to allow reusing the function for adding rewrite entries to the rewrite table, which is introduced in a later patch. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20devlink: add instance lock assertion in devl_is_registered()Jiri Pirko1-3/+1
After region and linecard lock removals, this helper is always supposed to be called with instance lock held. So put the assertion here and remove the comment which is no longer accurate. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: remove devlink_dump_for_each_instance_get() helperJiri Pirko2-12/+4
devlink_dump_for_each_instance_get() is currently called from a single place in netlink.c. As there is no need to use this helper anywhere else in the future, remove it and call devlinks_xa_find_get() directly from while loop in devlink_nl_instance_iter_dump(). Also remove redundant idx clear on loop end as it is already done in devlink_nl_instance_iter_dump(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: convert reporters dump to devlink_nl_instance_iter_dump()Jiri Pirko3-49/+40
Benefit from recently introduced instance iteration and convert reporters .dumpit generic netlink callback to use it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: convert linecards dump to devlink_nl_instance_iter_dump()Jiri Pirko3-36/+30
Benefit from recently introduced instance iteration and convert linecards .dumpit generic netlink callback to use it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: remove reporter reference countingJiri Pirko1-83/+30
As long as the reporter life time is protected by devlink instance lock, the reference counting is no longer needed. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: remove devl*_port_health_reporter_destroy()Jiri Pirko1-33/+2
Remove port-specific health reporter destroy function as it is currently the same as the instance one so no longer needed. Inline __devlink_health_reporter_destroy() as it is no longer called from multiple places. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: remove reporters_lockJiri Pirko3-47/+9
Similar to other devlink objects, rely on devlink instance lock and remove object specific reporters_lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: protect health reporter operation with instance lockJiri Pirko1-22/+77
Similar to other devlink objects, protect the reporters list by devlink instance lock. Alongside add unlocked versions of health reporter create/destroy functions and use them in drivers on call paths where the instance lock is held. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20devlink: remove linecard reference countingJiri Pirko3-18/+2
As long as the linecard life time is protected by devlink instance lock, the reference counting is no longer needed. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>