summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-04tools: ynl: add --list-ops and --list-msgs to CLIJakub Kicinski2-0/+11
I often forget the exact naming of ops and have to look at the spec to find it. Add support for listing the operations: $ ./cli.py --spec .../netdev.yaml --list-ops dev-get [ do, dump ] page-pool-get [ do, dump ] page-pool-stats-get [ do, dump ] queue-get [ do, dump ] napi-get [ do, dump ] qstats-get [ dump ] For completeness also support listing all ops (including notifications: # ./cli.py --spec .../netdev.yaml --list-msgs dev-get [ dump, do ] dev-add-ntf [ notify ] dev-del-ntf [ notify ] dev-change-ntf [ notify ] page-pool-get [ dump, do ] page-pool-add-ntf [ notify ] page-pool-del-ntf [ notify ] page-pool-change-ntf [ notify ] page-pool-stats-get [ dump, do ] queue-get [ dump, do ] napi-get [ dump, do ] qstats-get [ dump ] Use double space after the name for slightly easier to read output. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240502164043.2130184-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-04Merge branch 'rtnetlink-rtnl_stats_dump-changes'Jakub Kicinski1-38/+23
Eric Dumazet says: ==================== rtnetlink: rtnl_stats_dump() changes Getting rid of RTNL in rtnl_stats_dump() looks challenging. In the meantime, we can: 1) Avoid RTNL acquisition for the final NLMSG_DONE marker. 2) Use for_each_netdev_dump() instead of the net->dev_index_head[] hash table. ==================== Link: https://lore.kernel.org/r/20240502113748.1622637-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-04rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()Eric Dumazet1-37/+22
Switch rtnl_stats_dump() to use for_each_netdev_dump() instead of net->dev_index_head[] hash table. This makes the code much easier to read, and fixes scalability issues. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-04rtnetlink: change rtnl_stats_dump() return valueEric Dumazet1-1/+1
By returning 0 (or an error) instead of skb->len, we allow NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03Merge branch 'net-sysctl-sentinel'David S. Miller47-119/+48
Joel Granados says: ==================== sysctl: Remove sentinel elements from networking What? These commits remove the sentinel element (last empty element) from the sysctl arrays of all the files under the "net/" directory that register a sysctl array. The merging of the preparation patches [4] to mainline allows us to just remove sentinel elements without changing behavior. This is safe because the sysctl registration code (register_sysctl() and friends) use the array size in addition to checking for a sentinel [1]. Why? By removing the sysctl sentinel elements we avoid kernel bloat as ctl_table arrays get moved out of kernel/sysctl.c into their own respective subsystems. This move was started long ago to avoid merge conflicts; the sentinel removal bit came after Mathew Wilcox suggested it to avoid bloating the kernel by one element as arrays moved out. This patchset will reduce the overall build time size of the kernel and run time memory bloat by about ~64 bytes per declared ctl_table array (more info here [5]). When are we done? There are 4 patchest (25 commits [2]) that are still outstanding to completely remove the sentinels: files under "net/" (this patchset), files under "kernel/" dir, misc dirs (files under mm/ security/ and others) and the final set that removes the unneeded check for ->procname == NULL. Testing: * Ran sysctl selftests (./tools/testing/selftests/sysctl/sysctl.sh) * Ran this through 0-day with no errors or warnings Savings in vmlinux: A total of 64 bytes per sentinel is saved after removal; I measured in x86_64 to give an idea of the aggregated savings. The actual savings will depend on individual kernel configuration. * bloat-o-meter - The "yesall" config saves 3976 bytes (bloat-o-meter output [6]) - A reduced config [3] saves 1263 bytes (bloat-o-meter output [7]) Savings in allocated memory: None in this set but will occur when the superfluous allocations are removed from proc_sysctl.c. I include it here for context. The estimated savings during boot for config [3] are 6272 bytes. See [8] for how to measure it. Comments/feedback greatly appreciated Changes in v6: - Rebased onto net-next/main. - Besides re-running my cocci scripts, I ran a new find script [9]. Found 0 hits in net/ - Moved "i" variable declaraction out of for() in sysctl_core_net_init - Removed forgotten sentinel in mpls_table - Removed CONFIG_AX25_DAMA_SLAVE guard from net/ax25/ax25_ds_timer.c. It is not needed because that file is compiled only when CONFIG_AX25_DAMA_SLAVE is set. - When traversing smc_table, stop on ARRAY_SIZE instead of ARRAY_SIZE-1. - Link to v5: https://lore.kernel.org/r/20240426-jag-sysctl_remset_net-v5-0-e3b12f6111a6@samsung.com Changes in v5: - Added net files with additional variable to my test .config so the typo can be caught next time. - Fixed typo tabel_size -> table_size - Link to v4: https://lore.kernel.org/r/20240425-jag-sysctl_remset_net-v4-0-9e82f985777d@samsung.com Changes in v4: - Keep reverse xmas tree order when introducing new variables - Use a table_size variable to keep the value of ARRAY_SIZE - Separated the original "networking: Remove the now superfluous sentinel elements from ctl_table arra" into smaller commits to ease review - Merged x.25 and ax.25 commits together. - Removed any SOB from the commits that were changed - Link to v3: https://lore.kernel.org/r/20240412-jag-sysctl_remset_net-v3-0-11187d13c211@samsung.com Changes in v3: - Reworkded ax.25 - Added a BUILD_BUG_ON for the ax.25 commit - Added a CONFIG_AX25_DAMA_SLAVE guard where needed - Link to v2: https://lore.kernel.org/r/20240328-jag-sysctl_remset_net-v2-0-52c9fad9a1af@samsung.com Changes in v2: - Rebased to v6.9-rc1 - Removed unneeded comment from sysctl_net_ax25.c - Link to v1: https://lore.kernel.org/r/20240314-jag-sysctl_remset_net-v1-0-aa26b44d29d9@samsung.com ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados5-3/+7
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Avoid a buffer overflow when traversing the ctl_table by ensuring that AX25_MAX_VALUES is the same as the size of ax25_param_table. This is done with a BUILD_BUG_ON where ax25_param_table is defined and a CONFIG_AX25_DAMA_SLAVE guard in the unnamed enum definition as well as in the ax25_dev_device_up and ax25_ds_set_timer functions. The overflow happened when the sentinel was removed from ax25_param_table. The sentinel's data element was changed when CONFIG_AX25_DAMA_SLAVE was undefined. This had no adverse effects as it still stopped on the sentinel's null procname but needed to be addressed once the sentinel was removed. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03appletalk: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from atalk_table ctl_table array. Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03netfilter: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados7-21/+5
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel elements from ctl_table structs * Remove instances where an array element is zeroed out to make it look like a sentinel. This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration * Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL * Remove extra element in ctl_table arrays declarations Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: Remove ctl_table sentinel elements from several networking subsystemsJoel Granados10-26/+9
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) To avoid lots of small commits, this commit brings together network changes from (as they appear in MAINTAINERS) LLC, MPTCP, NETROM NETWORK LAYER, PHONET PROTOCOL, ROSE NETWORK LAYER, RXRPC SOCKETS, SCTP PROTOCOL, SHARED MEMORY COMMUNICATIONS (SMC), TIPC NETWORK LAYER and NETWORKING [IPSEC] * Remove sentinel element from ctl_table structs. * Replace empty array registration with the register_net_sysctl_sz call in llc_sysctl_init * Replace the for loop stop condition that tests for procname == NULL with one that depends on array size in sctp_sysctl_net_register * Remove instances where an array element is zeroed out to make it look like a sentinel in xfrm_sysctl_init. This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration * Use a table_size variable to keep the value of ARRAY_SIZE Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: sunrpc: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados4-4/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: rds: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados3-3/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. Signed-off-by: Joel Granados <j.granados@samsung.com> Acked-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: ipv{6,4}: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados11-35/+13
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. * Remove the zeroing out of an array element (to make it look like a sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init. This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration. * Remove extra sentinel element in the declaration of devinet_vars. * Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init, ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having an extra empty element when looping over ctl_table arrays * Replace the for loop stop condition in __addrconf_sysctl_register that tests for procname == NULL with one that depends on array size * Removing the unprivileged user check in ipv6_route_sysctl_init is safe as it is replaced by calling ipv6_route_sysctl_table_size; introduced in commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") * Use a table_size variable to keep the value of ARRAY_SIZE Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados6-26/+14
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. * Remove the zeroing out of an array element (to make it look like a sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register This is not longer needed and is safe after commit c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added the array size to the ctl_table registration. * Replace the for loop stop condition in sysctl_core_net_init that tests for procname == NULL with one that depends on array size * Removed the "-1" in mpls_net_init that adjusted for having an extra empty element when looping over ctl_table arrays * Use a table_size variable to keep the value of ARRAY_SIZE Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net_sched: sch_sfq: annotate data-races around q->perturb_periodEric Dumazet1-4/+9
sfq_perturbation() reads q->perturb_period locklessly. Add annotations to fix potential issues. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430180015.3111398-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03net: dsa: mv88e6xxx: Correct check for empty listSimon Horman1-2/+2
Since commit a3c53be55c95 ("net: dsa: mv88e6xxx: Support multiple MDIO busses") mv88e6xxx_default_mdio_bus() has checked that the return value of list_first_entry() is non-NULL. This appears to be intended to guard against the list chip->mdios being empty. However, it is not the correct check as the implementation of list_first_entry is not designed to return NULL for empty lists. Instead, use list_first_entry_or_null() which does return NULL if the list is empty. Flagged by Smatch. Compile tested only. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430-mv88e6xx-list_empty-v3-1-c35c69d88d2e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03selftests/net: skip partial checksum packets in csum testWillem de Bruijn1-3/+15
Detect packets with ip_summed CHECKSUM_PARTIAL and skip these. These should not exist, as the test sends individual packets between two hosts. But if (HW) GRO is on, with randomized content sometimes subsequent packets can be coalesced. In this case the GSO packet checksum is converted to a pseudo checksum in anticipation of sending out as TSO/USO. So the field will not match the expected value. Do not count these as test errors. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240501193156.3627344-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03selftests: net: py: check process exit code in bkg() and background cmd()Jakub Kicinski1-2/+6
We're a bit too loose with error checking for background processes. cmd() completely ignores the fail argument passed to the constructor if background is True. Default to checking for errors if process is not terminated explicitly. Caller can override with True / False. For bkg() the processing step is called magically by __exit__ so record the value passed in the constructor. Reported-by: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240502025325.1924923-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03IB/hfi1: allocate dummy net_device dynamicallyBreno Leitao2-3/+8
Embedding net_device into structures prohibits the usage of flexible arrays in the net_device structure. For more details, see the discussion at [1]. Un-embed the net_device from struct hfi1_netdev_rx by converting it into a pointer. Then use the leverage alloc_netdev() to allocate the net_device object at hfi1_alloc_rx(). [1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/ Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240430162213.746492-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03net/mlx5e: flower: check for unsupported control flagsAsbjørn Sloth Tønnesen1-6/+4
Use flow_rule_is_supp_control_flags() to reject filters with unsupported control flags. In case any unsupported control flags are masked, flow_rule_is_supp_control_flags() sets a NL extended error message, and we return -EOPNOTSUPP. Remove FLOW_DIS_FIRST_FRAG specific error message, and treat it as any other unsupported control flag. Only compile-tested. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240422152728.175677-1-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski306-1532/+2454
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge tag 'net-6.9-rc7' of ↵Linus Torvalds49-193/+378
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf. Relatively calm week, likely due to public holiday in most places. No known outstanding regressions. Current release - regressions: - rxrpc: fix wrong alignmask in __page_frag_alloc_align() - eth: e1000e: change usleep_range to udelay in PHY mdic access Previous releases - regressions: - gro: fix udp bad offset in socket lookup - bpf: fix incorrect runtime stat for arm64 - tipc: fix UAF in error path - netfs: fix a potential infinite loop in extract_user_to_sg() - eth: ice: ensure the copied buf is NUL terminated - eth: qeth: fix kernel panic after setting hsuid Previous releases - always broken: - bpf: - verifier: prevent userspace memory access - xdp: use flags field to disambiguate broadcast redirect - bridge: fix multicast-to-unicast with fraglist GSO - mptcp: ensure snd_nxt is properly initialized on connect - nsh: fix outer header access in nsh_gso_segment(). - eth: bcmgenet: fix racing registers access - eth: vxlan: fix stats counters. Misc: - a bunch of MAINTAINERS file updates" * tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits) MAINTAINERS: mark MYRICOM MYRI-10G as Orphan MAINTAINERS: remove Ariel Elior net: gro: add flush check in udp_gro_receive_segment net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb ipv4: Fix uninit-value access in __ip_make_skb() s390/qeth: Fix kernel panic after setting hsuid vxlan: Pull inner IP header in vxlan_rcv(). tipc: fix a possible memleak in tipc_buf_append tipc: fix UAF in error path rxrpc: Clients must accept conn from any address net: core: reject skb_copy(_expand) for fraglist GSO skbs net: bridge: fix multicast-to-unicast with fraglist GSO mptcp: ensure snd_nxt is properly initialized on connect e1000e: change usleep_range to udelay in PHY mdic access net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341 cxgb4: Properly lock TX queue for the selftest. rxrpc: Fix using alignmask being zero for __page_frag_alloc_align() vxlan: Add missing VNI filter counter update in arp_reduce(). vxlan: Fix racy device stats updates. net: qede: use return from qede_parse_actions() ...
2024-05-02Merge branch 'bnxt_en-updates-for-net-next'Jakub Kicinski6-61/+107
Michael Chan says: ==================== bnxt_en: Updates for net-next The first patch converts the sw_stats field in the completion ring structure to a pointer. This allows the group of completion rings using the same MSIX to share the same sw_stats structure. Prior to this, the correct completion ring must be used to count packets. The next four patches remove the RTNL lock when calling the RoCE driver for asynchronous stop and start during error recovery and firmware reset. The RTNL ilock is replaced with a private mutex used to synchronize RoCE register, unregister, stop, and start. The last patch adds VF PCI IDs for the 5760X chips. v2: Dropped patch #1 from v1. Will work with David to get that patch in separately. ==================== Link: https://lore.kernel.org/r/20240501003056.100607-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: Add VF PCI ID for 5760X (P7) chipsAjit Khaparde2-1/+4
No driver logic changes are required to support the VFs, so just add the VF PCI ID. Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-7-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: Optimize recovery path ULP locking in the driverKalesh AP3-26/+44
In the error recovery path (AER, firmware recovery, etc), the driver notifies the RoCE driver via ULP_STOP before the reset and via ULP_START after the reset, all under RTNL_LOCK. The RoCE driver can take a long time if there are a lot of QPs to destroy, so it is not ideal to hold the global RTNL lock. Rely on the new en_dev_lock mutex instead for ULP_STOP and ULP_START. For the most part, we move the ULP_STOP call before we take the RTNL lock and move the ULP_START after RTNL unlock. Note that SRIOV re-enablement must be done after ULP_START or RoCE on the VFs will not resume properly after reset. The one scenario in bnxt_hwrm_if_change() where the RTNL lock is already taken in the .ndo_open() context requires the ULP restart to be deferred to the bnxt_sp_task() workqueue. Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-6-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: Add a mutex to synchronize ULP operationsKalesh AP2-1/+22
The current scheme relies heavily on the RTNL lock for all ULP operations between the L2 and the RoCE driver. Add a new en_dev_lock mutex so that the asynchronous ULP_STOP and ULP_START operations can be serialized with bnxt_register_dev() and bnxt_unregister_dev() calls without relying on the RTNL lock. The next patch will remove the RTNL lock from the ULP_STOP and ULP_START calls. Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: Don't call ULP_STOP/ULP_START during L2 resetMichael Chan1-11/+2
There is no need to call ULP_STOP and ULP_START before and after the L2 reset in bnxt_reset_task(). This L2 reset is done after detecting TX timeout, RX ring errors, or VF config changes. The L2 reset does not affect RoCE since the firmware is not reset and the backing store is left alone. Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: Don't support offline self test when RoCE driver is loadedKalesh AP1-3/+8
Offline self test is a very disruptive operation for RoCE and requires all active QPs to be destroyed. With a large number of QPs, it can take a long time to destroy all the QPs and can timeout. Do not allow ethtool offline self test if the RoCE driver is registered on the device. Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02bnxt_en: share NQ ring sw_stats memory with subringsEdwin Peer3-19/+27
On P5_PLUS chips and later, the NQ rings have subrings for RX and TX completions respectively. These subrings are passed to the poll function instead of the base NQ, but each ring carries its own copy of the software ring statistics. For stats to be conveniently accessible in __bnxt_poll_work(), the statistics memory should either be shared between the NQ and its subrings or the subrings need to be included in the ethtool stats aggregation logic. This patch opts for the former, because it's more efficient and less confusing having the software statistics for a ring exist in a single place. Before this patch, the counter will not be displayed if the "wrong" cpr->sw_stats was used to increment a counter. Link: https://lore.kernel.org/netdev/CACKFLikEhVAJA+osD7UjQNotdGte+fth7zOy7yDdLkTyFk9Pyw@mail.gmail.com/ Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge branch '40GbE' of ↵Jakub Kicinski10-152/+211
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== i40e: cleanups & refactors Ivan Vecera says: This series do following: Patch 1 - Removes write-only flags field from i40e_veb structure and from i40e_veb_setup() parameters Patch 2 - Refactors parameter of i40e_notify_client_of_l2_param_changes() and i40e_notify_client_of_netdev_close() Patch 3 - Refactors parameter of i40e_detect_recover_hung() Patch 4 - Adds helper i40e_pf_get_main_vsi() to get main VSI and uses it in existing code Patch 5 - Consolidates checks whether given VSI is the main one Patch 6 - Adds helper i40e_pf_get_main_veb() to get main VEB and uses it in existing code Patch 7 - Adds helper i40e_vsi_reconfig_tc() to reconfigure TC for particular and uses it to replace existing open-coded pieces * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: Add and use helper to reconfigure TC for given VSI i40e: Add helper to access main VEB i40e: Consolidate checks whether given VSI is main i40e: Add helper to access main VSI i40e: Refactor argument of i40e_detect_recover_hung() i40e: Refactor argument of several client notification functions i40e: Remove flags field from i40e_veb ==================== Link: https://lore.kernel.org/r/20240430180639.1938515-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net/sched: unregister lockdep keys in qdisc_create/qdisc_alloc error pathDavide Caratti2-0/+2
Naresh and Eric report several errors (corrupted elements in the dynamic key hash list), when running tdc.py or syzbot. The error path of qdisc_alloc() and qdisc_create() frees the qdisc memory, but it forgets to unregister the lockdep key, thus causing use-after-free like the following one: ================================================================== BUG: KASAN: slab-use-after-free in lockdep_register_key+0x5f2/0x700 Read of size 8 at addr ffff88811236f2a8 by task ip/7925 CPU: 26 PID: 7925 Comm: ip Kdump: loaded Not tainted 6.9.0-rc2+ #648 Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0 07/26/2013 Call Trace: <TASK> dump_stack_lvl+0x7c/0xc0 print_report+0xc9/0x610 kasan_report+0x89/0xc0 lockdep_register_key+0x5f2/0x700 qdisc_alloc+0x21d/0xb60 qdisc_create_dflt+0x63/0x3c0 attach_one_default_qdisc.constprop.37+0x8e/0x170 dev_activate+0x4bd/0xc30 __dev_open+0x275/0x380 __dev_change_flags+0x3f1/0x570 dev_change_flags+0x7c/0x160 do_setlink+0x1ea1/0x34b0 __rtnl_newlink+0x8c9/0x1510 rtnl_newlink+0x61/0x90 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7f9503f4fa07 Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RSP: 002b:00007fff6c729068 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000006630c681 RCX: 00007f9503f4fa07 RDX: 0000000000000000 RSI: 00007fff6c7290d0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000078 R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000001 R13: 00007fff6c729180 R14: 0000000000000000 R15: 000055bf67dd9040 </TASK> Allocated by task 7745: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 __kasan_kmalloc+0x7b/0x90 __kmalloc_node+0x1ff/0x460 qdisc_alloc+0xae/0xb60 qdisc_create+0xdd/0xfb0 tc_modify_qdisc+0x37e/0x1960 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 Freed by task 7745: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 kasan_save_free_info+0x36/0x60 __kasan_slab_free+0xfe/0x180 kfree+0x113/0x380 qdisc_create+0xafb/0xfb0 tc_modify_qdisc+0x37e/0x1960 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 Fix this ensuring that lockdep_unregister_key() is called before the qdisc struct is freed, also in the error path of qdisc_create() and qdisc_alloc(). Fixes: af0cb3fa3f9e ("net/sched: fix false lockdep warning on qdisc root lock") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Closes: https://lore.kernel.org/netdev/20240429221706.1492418-1-naresh.kamboju@linaro.org/ Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/2aa1ca0c0a3aa0acc15925c666c777a4b5de553c.1714496886.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02MAINTAINERS: mark MYRICOM MYRI-10G as OrphanJakub Kicinski1-2/+1
Chris's email address bounces and lore hasn't seen an email from anyone with his name for almost a decade. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430233532.1356982-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02MAINTAINERS: remove Ariel EliorJakub Kicinski1-3/+0
aelior@marvell.com bounces, we haven't seen Ariel on lore since March 2022. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20240430233305.1356105-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02Merge branch 'net-gro-add-flush-flush_id-checks-and-fix-wrong-offset-in-udp'Paolo Abeni9-5/+33
Richard Gobert says: ==================== net: gro: add flush/flush_id checks and fix wrong offset in udp This series fixes a bug in the complete phase of UDP in GRO, in which socket lookup fails due to using network_header when parsing encapsulated packets. The fix is to add network_offset and inner_network_offset to napi_gro_cb and use these offsets for socket lookup. In addition p->flush/flush_id should be checked in all UDP flows. The same logic from tcp_gro_receive is applied for all flows in udp_gro_receive_segment. This prevents packets with mismatching network headers (flush/flush_id turned on) from merging in UDP GRO. The original series includes a change to vxlan test which adds the local parameter to prevent similar future bugs. I plan to submit it separately to net-next. This series is part of a previously submitted series to net-next: https://lore.kernel.org/all/20240408141720.98832-1-richardbgobert@gmail.com/ v3 -> v4: - Store network offsets, and use them only in udp_gro_complete flows - Correct commit hash used in Fixes tag - v3: https://lore.kernel.org/netdev/20240424163045.123528-1-richardbgobert@gmail.com/ v2 -> v3: - Add network_offsets and fix udp bug in a single commit to make backporting easier - Write to inner_network_offset in {inet,ipv6}_gro_receive - Use network_offsets union in tcp[46]_gro_complete as well - v2: https://lore.kernel.org/netdev/20240419153542.121087-1-richardbgobert@gmail.com/ v1 -> v2: - Use network_offsets instead of p_poff param as suggested by Willem - Check flush before postpull, and for all UDP GRO flows - v1: https://lore.kernel.org/netdev/20240412152120.115067-1-richardbgobert@gmail.com/ ==================== Link: https://lore.kernel.org/r/20240430143555.126083-1-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02net: gro: add flush check in udp_gro_receive_segmentRichard Gobert1-1/+11
GRO-GSO path is supposed to be transparent and as such L3 flush checks are relevant to all UDP flows merging in GRO. This patch uses the same logic and code from tcp_gro_receive, terminating merge if flush is non zero. Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02net: gro: fix udp bad offset in socket lookup by adding ↵Richard Gobert9-4/+22
{inner_}network_offset to napi_gro_cb Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp: additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the complete phase of gro. The functions always return skb->network_header, which in the case of encapsulated packets at the gro complete phase, is always set to the innermost L3 of the packet. That means that calling {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_ L3/L4 may return an unexpected value. This incorrect usage leads to a bug in GRO's UDP socket lookup. udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These *_hdr functions return network_header which will point to the innermost L3, resulting in the wrong offset being used in __udp{4,6}_lib_lookup with encapsulated packets. This patch adds network_offset and inner_network_offset to napi_gro_cb, and makes sure both are set correctly. To fix the issue, network_offsets union is used inside napi_gro_cb, in which both the outer and the inner network offsets are saved. Reproduction example: Endpoint configuration example (fou + local address bind) # ip fou add port 6666 ipproto 4 # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip # ip link set tun1 up # ip a add 1.1.1.2/24 dev tun1 Netperf TCP_STREAM result on net-next before patch is applied: net-next main, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.28 2.37 net-next main, GRO disabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2745.06 patch applied, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2877.38 Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02ipv4: Fix uninit-value access in __ip_make_skb()Shigeru Yoshida2-1/+4
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb() tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL while __ip_make_skb() is running, the function will access icmphdr in the skb even if it is not included. This causes the issue reported by KMSAN. Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL on the socket. Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These are union in struct flowi4 and are implicitly initialized by flowi4_init_output(), but we should not rely on specific union layout. Initialize these explicitly in raw_sendmsg(). [1] BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 ip_finish_skb include/net/ip.h:243 [inline] ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: slab_post_alloc_hook mm/slub.c:3804 [inline] slab_alloc_node mm/slub.c:3845 [inline] kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668 alloc_skb include/linux/skbuff.h:1318 [inline] __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02Merge remote branch 'xfrm: Introduce direction attribute for SA'Steffen Klassert13-9/+218
Antony Antony says: ==================== Inspired by the upcoming IP-TFS patch set, and confusions experienced in the past due to lack of direction attribute on SAs, add a new direction "dir" attribute. It aims to streamline the SA configuration process and enhance the clarity of existing SA attributes. This patch set introduces the 'dir' attribute to SA, aka xfrm_state, ('in' for input or 'out' for output). Alsp add validations of existing direction-specific SA attributes during configuration and in the data path lookup. This change would not affect any existing use case or way of configuring SA. You will notice improvements when the new 'dir' attribute is set. ==================== Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-02s390/qeth: Fix kernel panic after setting hsuidAlexandra Winter1-34/+27
Symptom: When the hsuid attribute is set for the first time on an IQD Layer3 device while the corresponding network interface is already UP, the kernel will try to execute a napi function pointer that is NULL. Example: --------------------------------------------------------------------------- [ 2057.572696] illegal operation: 0001 ilc:1 [#1] SMP [ 2057.572702] Modules linked in: af_iucv qeth_l3 zfcp scsi_transport_fc sunrpc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink ghash_s390 prng xts aes_s390 des_s390 de s_generic sha3_512_s390 sha3_256_s390 sha512_s390 vfio_ccw vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio ext4 mbcache jbd2 qeth_l2 bridge stp llc dasd_eckd_mod qeth dasd_mod qdio ccwgroup pkey zcrypt [ 2057.572739] CPU: 6 PID: 60182 Comm: stress_client Kdump: loaded Not tainted 4.18.0-541.el8.s390x #1 [ 2057.572742] Hardware name: IBM 3931 A01 704 (LPAR) [ 2057.572744] Krnl PSW : 0704f00180000000 0000000000000002 (0x2) [ 2057.572748] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 2057.572751] Krnl GPRS: 0000000000000004 0000000000000000 00000000a3b008d8 0000000000000000 [ 2057.572754] 00000000a3b008d8 cb923a29c779abc5 0000000000000000 00000000814cfd80 [ 2057.572756] 000000000000012c 0000000000000000 00000000a3b008d8 00000000a3b008d8 [ 2057.572758] 00000000bab6d500 00000000814cfd80 0000000091317e46 00000000814cfc68 [ 2057.572762] Krnl Code:#0000000000000000: 0000 illegal >0000000000000002: 0000 illegal 0000000000000004: 0000 illegal 0000000000000006: 0000 illegal 0000000000000008: 0000 illegal 000000000000000a: 0000 illegal 000000000000000c: 0000 illegal 000000000000000e: 0000 illegal [ 2057.572800] Call Trace: [ 2057.572801] ([<00000000ec639700>] 0xec639700) [ 2057.572803] [<00000000913183e2>] net_rx_action+0x2ba/0x398 [ 2057.572809] [<0000000091515f76>] __do_softirq+0x11e/0x3a0 [ 2057.572813] [<0000000090ce160c>] do_softirq_own_stack+0x3c/0x58 [ 2057.572817] ([<0000000090d2cbd6>] do_softirq.part.1+0x56/0x60) [ 2057.572822] [<0000000090d2cc60>] __local_bh_enable_ip+0x80/0x98 [ 2057.572825] [<0000000091314706>] __dev_queue_xmit+0x2be/0xd70 [ 2057.572827] [<000003ff803dd6d6>] afiucv_hs_send+0x24e/0x300 [af_iucv] [ 2057.572830] [<000003ff803dd88a>] iucv_send_ctrl+0x102/0x138 [af_iucv] [ 2057.572833] [<000003ff803de72a>] iucv_sock_connect+0x37a/0x468 [af_iucv] [ 2057.572835] [<00000000912e7e90>] __sys_connect+0xa0/0xd8 [ 2057.572839] [<00000000912e9580>] sys_socketcall+0x228/0x348 [ 2057.572841] [<0000000091514e1a>] system_call+0x2a6/0x2c8 [ 2057.572843] Last Breaking-Event-Address: [ 2057.572844] [<0000000091317e44>] __napi_poll+0x4c/0x1d8 [ 2057.572846] [ 2057.572847] Kernel panic - not syncing: Fatal exception in interrupt ------------------------------------------------------------------------------------------- Analysis: There is one napi structure per out_q: card->qdio.out_qs[i].napi The napi.poll functions are set during qeth_open(). Since commit 1cfef80d4c2b ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)") qeth_set_offline()/qeth_set_online() no longer call dev_close()/ dev_open(). So if qeth_free_qdio_queues() cleared card->qdio.out_qs[i].napi.poll while the network interface was UP and the card was offline, they are not set again. Reproduction: chzdev -e $devno layer2=0 ip link set dev $network_interface up echo 0 > /sys/bus/ccwgroup/devices/0.0.$devno/online echo foo > /sys/bus/ccwgroup/devices/0.0.$devno/hsuid echo 1 > /sys/bus/ccwgroup/devices/0.0.$devno/online -> Crash (can be enforced e.g. by af_iucv connect(), ip link down/up, ...) Note that a Completion Queue (CQ) is only enabled or disabled, when hsuid is set for the first time or when it is removed. Workarounds: - Set hsuid before setting the device online for the first time or - Use chzdev -d $devno; chzdev $devno hsuid=xxx; chzdev -e $devno; to set hsuid on an existing device. (this will remove and recreate the network interface) Fix: There is no need to free the output queues when a completion queue is added or removed. card->qdio.state now indicates whether the inbound buffer pool and the outbound queues are allocated. card->qdio.c_q indicates whether a CQ is allocated. Fixes: 1cfef80d4c2b ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430091004.2265683-1-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02vxlan: Pull inner IP header in vxlan_rcv().Guillaume Nault1-1/+18
Ensure the inner IP header is part of skb's linear data before reading its ECN bits. Otherwise we might read garbage. One symptom is the system erroneously logging errors like "vxlan: non-ECT from xxx.xxx.xxx.xxx with TOS=xxxx". Similar bugs have been fixed in geneve, ip_tunnel and ip6_tunnel (see commit 1ca1ba465e55 ("geneve: make sure to pull inner header in geneve_rx()") for example). So let's reuse the same code structure for consistency. Maybe we'll can add a common helper in the future. Fixes: d342894c5d2f ("vxlan: virtual extensible lan") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/1239c8db54efec341dd6455c77e0380f58923a3c.1714495737.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge branch 'net-dsa-adjust_link-removal'Jakub Kicinski3-141/+12
Florian Fainelli says: ==================== net: dsa: adjust_link removal Now that the last in-tree driver (b53) has been converted to PHYLINK, we can get rid of all of code that catered to working with drivers implementing only PHYLIB's adjust_link callback. ==================== Link: https://lore.kernel.org/r/20240430164816.2400606-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net: dsa: Remove adjust_link pathsFlorian Fainelli3-137/+12
Now that we no longer any drivers using PHYLIB's adjust_link callback, remove all paths that made use of adjust_link as well as the associated functions. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-3-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net: dsa: Remove fixed_link_update memberFlorian Fainelli1-4/+0
We have not had a switch driver use a fixed_link_update callback since 58d56fcc3964f9be0a9ca42fd126bcd9dc7afc90 ("net: dsa: bcm_sf2: Get rid of PHYLIB functions") remove this callback. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-2-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02tipc: fix a possible memleak in tipc_buf_appendXin Long1-1/+1
__skb_linearize() doesn't free the skb when it fails, so move '*buf = NULL' after __skb_linearize(), so that the skb can be freed on the err path. Fixes: b7df21cf1b79 ("tipc: skb_linearize the head skb when reassembling msgs") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Link: https://lore.kernel.org/r/90710748c29a1521efac4f75ea01b3b7e61414cf.1714485818.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02tipc: fix UAF in error pathPaolo Abeni1-1/+5
Sam Page (sam4k) working with Trend Micro Zero Day Initiative reported a UAF in the tipc_buf_append() error path: BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183 Read of size 8 at addr ffff88804d2a7c80 by task poc/8034 CPU: 1 PID: 8034 Comm: poc Not tainted 6.8.2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 Call Trace: <IRQ> __dump_stack linux/lib/dump_stack.c:88 dump_stack_lvl+0xd9/0x1b0 linux/lib/dump_stack.c:106 print_address_description linux/mm/kasan/report.c:377 print_report+0xc4/0x620 linux/mm/kasan/report.c:488 kasan_report+0xda/0x110 linux/mm/kasan/report.c:601 kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183 skb_release_data+0x5af/0x880 linux/net/core/skbuff.c:1026 skb_release_all linux/net/core/skbuff.c:1094 __kfree_skb linux/net/core/skbuff.c:1108 kfree_skb_reason+0x12d/0x210 linux/net/core/skbuff.c:1144 kfree_skb linux/./include/linux/skbuff.h:1244 tipc_buf_append+0x425/0xb50 linux/net/tipc/msg.c:186 tipc_link_input+0x224/0x7c0 linux/net/tipc/link.c:1324 tipc_link_rcv+0x76e/0x2d70 linux/net/tipc/link.c:1824 tipc_rcv+0x45f/0x10f0 linux/net/tipc/node.c:2159 tipc_udp_recv+0x73b/0x8f0 linux/net/tipc/udp_media.c:390 udp_queue_rcv_one_skb+0xad2/0x1850 linux/net/ipv4/udp.c:2108 udp_queue_rcv_skb+0x131/0xb00 linux/net/ipv4/udp.c:2186 udp_unicast_rcv_skb+0x165/0x3b0 linux/net/ipv4/udp.c:2346 __udp4_lib_rcv+0x2594/0x3400 linux/net/ipv4/udp.c:2422 ip_protocol_deliver_rcu+0x30c/0x4e0 linux/net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2e4/0x520 linux/net/ipv4/ip_input.c:233 NF_HOOK linux/./include/linux/netfilter.h:314 NF_HOOK linux/./include/linux/netfilter.h:308 ip_local_deliver+0x18e/0x1f0 linux/net/ipv4/ip_input.c:254 dst_input linux/./include/net/dst.h:461 ip_rcv_finish linux/net/ipv4/ip_input.c:449 NF_HOOK linux/./include/linux/netfilter.h:314 NF_HOOK linux/./include/linux/netfilter.h:308 ip_rcv+0x2c5/0x5d0 linux/net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0x199/0x1e0 linux/net/core/dev.c:5534 __netif_receive_skb+0x1f/0x1c0 linux/net/core/dev.c:5648 process_backlog+0x101/0x6b0 linux/net/core/dev.c:5976 __napi_poll.constprop.0+0xba/0x550 linux/net/core/dev.c:6576 napi_poll linux/net/core/dev.c:6645 net_rx_action+0x95a/0xe90 linux/net/core/dev.c:6781 __do_softirq+0x21f/0x8e7 linux/kernel/softirq.c:553 do_softirq linux/kernel/softirq.c:454 do_softirq+0xb2/0xf0 linux/kernel/softirq.c:441 </IRQ> <TASK> __local_bh_enable_ip+0x100/0x120 linux/kernel/softirq.c:381 local_bh_enable linux/./include/linux/bottom_half.h:33 rcu_read_unlock_bh linux/./include/linux/rcupdate.h:851 __dev_queue_xmit+0x871/0x3ee0 linux/net/core/dev.c:4378 dev_queue_xmit linux/./include/linux/netdevice.h:3169 neigh_hh_output linux/./include/net/neighbour.h:526 neigh_output linux/./include/net/neighbour.h:540 ip_finish_output2+0x169f/0x2550 linux/net/ipv4/ip_output.c:235 __ip_finish_output linux/net/ipv4/ip_output.c:313 __ip_finish_output+0x49e/0x950 linux/net/ipv4/ip_output.c:295 ip_finish_output+0x31/0x310 linux/net/ipv4/ip_output.c:323 NF_HOOK_COND linux/./include/linux/netfilter.h:303 ip_output+0x13b/0x2a0 linux/net/ipv4/ip_output.c:433 dst_output linux/./include/net/dst.h:451 ip_local_out linux/net/ipv4/ip_output.c:129 ip_send_skb+0x3e5/0x560 linux/net/ipv4/ip_output.c:1492 udp_send_skb+0x73f/0x1530 linux/net/ipv4/udp.c:963 udp_sendmsg+0x1a36/0x2b40 linux/net/ipv4/udp.c:1250 inet_sendmsg+0x105/0x140 linux/net/ipv4/af_inet.c:850 sock_sendmsg_nosec linux/net/socket.c:730 __sock_sendmsg linux/net/socket.c:745 __sys_sendto+0x42c/0x4e0 linux/net/socket.c:2191 __do_sys_sendto linux/net/socket.c:2203 __se_sys_sendto linux/net/socket.c:2199 __x64_sys_sendto+0xe0/0x1c0 linux/net/socket.c:2199 do_syscall_x64 linux/arch/x86/entry/common.c:52 do_syscall_64+0xd8/0x270 linux/arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 linux/arch/x86/entry/entry_64.S:120 RIP: 0033:0x7f3434974f29 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007fff9154f2b8 EFLAGS: 00000212 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3434974f29 RDX: 00000000000032c8 RSI: 00007fff9154f300 RDI: 0000000000000003 RBP: 00007fff915532e0 R08: 00007fff91553360 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000212 R12: 000055ed86d261d0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> In the critical scenario, either the relevant skb is freed or its ownership is transferred into a frag_lists. In both cases, the cleanup code must not free it again: we need to clear the skb reference earlier. Fixes: 1149557d64c9 ("tipc: eliminate unnecessary linearization of incoming buffers") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-23852 Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/752f1ccf762223d109845365d07f55414058e5a3.1714484273.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net: ti: icssg_prueth: Add SW TX / RX Coalescing based on hrtimersMD Danish Anwar4-8/+155
Add SW IRQ coalescing based on hrtimers for RX and TX data path for ICSSG driver, which can be enabled by ethtool commands: - RX coalescing ethtool -C eth1 rx-usecs 50 - TX coalescing can be enabled per TX queue - by default enables coalescing for TX0 ethtool -C eth1 tx-usecs 50 - configure TX0 ethtool -Q eth0 queue_mask 1 --coalesce tx-usecs 100 - configure TX1 ethtool -Q eth0 queue_mask 2 --coalesce tx-usecs 100 - configure TX0 and TX1 ethtool -Q eth0 queue_mask 3 --coalesce tx-usecs 100 --coalesce tx-usecs 100 Minimum value for both rx-usecs and tx-usecs is 20us. Compared to gro_flush_timeout and napi_defer_hard_irqs this patch allows to enable IRQ coalescing for RX path separately. Benchmarking numbers: =============================================================== | Method | Tput_TX | CPU_TX | Tput_RX | CPU_RX | | ============================================================== | Default Driver 943 Mbps 31% 517 Mbps 38% | | IRQ Coalescing (Patch) 943 Mbps 28% 518 Mbps 25% | =============================================================== Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430120634.1558998-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge branch 'arp-random-clean-up-and-rcu-conversion-for-ioctl-siocgarp'Jakub Kicinski3-84/+147
Kuniyuki Iwashima says: ==================== arp: Random clean up and RCU conversion for ioctl(SIOCGARP). arp_ioctl() holds rtnl_lock() regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. This series cleans up ioctl() code a bit and extends the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() and netdev_copy_name() for SIOCGARP. v2: https://lore.kernel.org/netdev/20240425170002.68160-1-kuniyu@amazon.com/ v1: https://lore.kernel.org/netdev/20240422194755.4221-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240430015813.71143-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Convert ioctl(SIOCGARP) to RCU.Kuniyuki Iwashima1-10/+18
ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name() and copy dev->name safely and calls neigh_lookup() later, which looks up a neighbour entry under RCU. Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy() with netdev_copy_name() to avoid locking rtnl_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02net: Protect dev->name by seqlock.Kuniyuki Iwashima2-4/+24
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy dev->name which is currently protected by rtnl_lock(). This patch does the following: 1) Add seqlock netdev_rename_lock to protect dev->name 2) Add netdev_copy_name() that copies dev->name to buffer under netdev_rename_lock 3) Use netdev_copy_name() in netdev_get_name() and drop devnet_rename_sem Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Get dev after calling arp_req_(delete|set|get)().Kuniyuki Iwashima1-36/+50
arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. We will extend the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() for SIOCGARP. As a preparation, let's move __dev_get_by_name() into another function and call it from arp_req_delete(), arp_req_set(), and arp_req_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02arp: Remove a nest in arp_req_get().Kuniyuki Iwashima1-13/+18
This is a prep patch to make the following changes tidy. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>