summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-01-06ethtool: use phydev variableTom Rix1-4/+4
In ethtool_get_phy_stats(), the phydev varaible is set to dev->phydev but dev->phydev is still used. Replace dev->phydev uses with phydev. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li3-1/+28
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple socketsMiroslav Lichvar1-3/+6
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received a packet with a hardware timestamp (e.g. multiple PTP instances in different PTP domains using the UDPv4/v6 multicast or L2 transport), the timestamps received on some sockets were corrupted due to repeated conversion of the same timestamp (by the same or different vclocks). Fix ptp_convert_timestamp() to not modify the shared skb timestamp and return the converted timestamp as a ktime_t instead. If the conversion fails, return 0 to not confuse the application with timestamps corresponding to an unexpected PHC. Fixes: d7c088265588 ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: setup master before portsVladimir Oltean1-10/+13
It is said that as soon as a network interface is registered, all its resources should have already been prepared, so that it is available for sending and receiving traffic. One of the resources needed by a DSA slave interface is the master. dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> register_netdevice -> dsa_tree_setup_master -> dsa_master_setup -> sets up master->dsa_ptr, which enables reception Therefore, there is a short period of time after register_netdevice() during which the master isn't prepared to pass traffic to the DSA layer (master->dsa_ptr is checked by eth_type_trans). Same thing during unregistration, there is a time frame in which packets might be missed. Note that this change opens us to another race: dsa_master_find_slave() will get invoked potentially earlier than the slave creation, and later than the slave deletion. Since dp->slave starts off as a NULL pointer, the earlier calls aren't a problem, but the later calls are. To avoid use-after-free, we should zeroize dp->slave before calling dsa_slave_destroy(). In practice I cannot really test real life improvements brought by this change, since in my systems, netdevice creation races with PHY autoneg which takes a few seconds to complete, and that masks quite a few races. Effects might be noticeable in a setup with fixed links all the way to an external system. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: first set up shared ports, then non-shared portsVladimir Oltean1-13/+37
After commit a57d8c217aad ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports"), the port setup and teardown procedure became asymmetric. The fact of the matter is that user ports need the shared ports to be up before they can be used for CPU-initiated termination. And since we register net devices for the user ports, those won't be functional until we also call the setup for the shared (CPU, DSA) ports. But we may do that later, depending on the port numbering scheme of the hardware we are dealing with. It just makes sense that all shared ports are brought up before any user port is. I can't pinpoint any issue due to the current behavior, but let's change it nonetheless, for consistency's sake. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}Vladimir Oltean2-2/+10
DSA needs to simulate master tracking events when a binding is first with a DSA master established and torn down, in order to give drivers the simplifying guarantee that ->master_state_change calls are made only when the master's readiness state to pass traffic changes. master_state_change() provide a operational bool that DSA driver can use to understand if DSA master is operational or not. To avoid races, we need to block the reception of NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier chain while we are changing the master's dev->dsa_ptr (this changes what netdev_uses_dsa(dev) reports). The dsa_master_setup() and dsa_master_teardown() functions optionally require the rtnl_mutex to be held, if the tagger needs the master to be promiscuous, these functions call dev_set_promiscuity(). Move the rtnl_lock() from that function and make it top-level. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: stop updating master MTU from master.cVladimir Oltean1-24/+1
At present there are two paths for changing the MTU of the DSA master. The first is: dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> dsa_slave_change_mtu -> dev_set_mtu(master) The second is: dsa_tree_setup -> dsa_tree_setup_master -> dsa_master_setup -> dev_set_mtu(dev) So the dev_set_mtu() call from dsa_master_setup() has been effectively superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is done from dsa_slave_create() for each user port. The later function also updates the master MTU according to the largest user port MTU from the tree. Therefore, updating the master MTU through a separate code path isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: merge rtnl_lock sections in dsa_slave_createVladimir Oltean1-3/+1
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that the operation can execute slighly faster. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: reorder PHY initialization with MTU setup in slave.cVladimir Oltean1-7/+7
In dsa_slave_create() there are 2 sections that take rtnl_lock(): MTU change and netdev registration. They are separated by PHY initialization. There isn't any strict ordering requirement except for the fact that netdev registration should be last. Therefore, we can perform the MTU change a bit later, after the PHY setup. A future change will then be able to merge the two rtnl_lock sections into one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller8-8/+88
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski21-177/+269
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05netlink: do not allocate a device refcount tracker in ethnl_default_notify()Eric Dumazet1-1/+0
As reported by Johannes, the tracker allocated in ethnl_default_notify() is not really needed, as this function is not expected to change a device reference count. Fixes: e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Johannes Berg <johannes@sipsolutions.net> Tested-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20220105170849.2610470-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net/sched: add missing tracker information in qdisc_create()Eric Dumazet1-1/+1
qdisc_create() error path needs to use dev_put_track() because qdisc_alloc() allocated the tracker. Fixes: 606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220104170439.3790052-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: dsa: remove cross-chip support for HSRVladimir Oltean3-49/+13
The cross-chip notifiers for HSR are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: George McCollister <george.mccollister@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: remove cross-chip support for MRPVladimir Oltean3-106/+20
The cross-chip notifiers for MRP are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: fix incorrect function pointer check for MRP ring rolesVladimir Oltean1-2/+2
The cross-chip notifier boilerplate code meant to check the presence of ds->ops->port_mrp_add_ring_role before calling it, but checked ds->ops->port_mrp_add instead, before calling ds->ops->port_mrp_add_ring_role. Therefore, a driver which implements one operation but not the other would trigger a NULL pointer dereference. There isn't any such driver in DSA yet, so there is no reason to backport the change. Issue found through code inspection. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Fixes: c595c4330da0 ("net: dsa: add MRP support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: make dsa_switch :: num_ports an unsigned intVladimir Oltean1-1/+1
Currently, num_ports is declared as size_t, which is defined as __kernel_ulong_t, therefore it occupies 8 bytes of memory. Even switches with port numbers in the range of tens are exotic, so there is no need for this amount of storage. Additionally, because the max_num_bridges member right above it is also 4 bytes, it means the compiler needs to add padding between the last 2 fields. By reducing the size, we don't need that padding and can reduce the struct size. Before: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 136 8 */ /* size: 144, cachelines: 3, members: 27 */ /* sum members: 132, holes: 2, sum holes: 8 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ unsigned int num_ports; /* 132 4 */ /* size: 136, cachelines: 3, members: 27 */ /* sum members: 128, holes: 1, sum holes: 4 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Merge tag 'mac80211-for-net-2022-01-04' of ↵Jakub Kicinski4-82/+55
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more changes: - mac80211: initialize a variable to avoid using it uninitialized - mac80211 mesh: put some data structures into the container to fix bugs with and not have to deal with allocation failures * tag 'mac80211-for-net-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_mesh mac80211: initialize variable have_higher_than_11mbit ==================== Link: https://lore.kernel.org/r/20220104144449.64937-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-04mac80211: use ieee80211_bss_get_elem()Johannes Berg1-7/+7
Instead of ieee80211_bss_get_ie(), use the more typed ieee80211_bss_get_elem(). Link: https://lore.kernel.org/r/20211220113609.56f8e2a70152.Id5a56afb8a4f9b38d10445e5a1874e93e84b5251@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: Add stations iterator where the iterator function may sleepMartin Blumenstingl1-0/+13
ieee80211_iterate_active_interfaces() and ieee80211_iterate_active_interfaces_atomic() already exist, where the former allows the iterator function to sleep. Add ieee80211_iterate_stations() which is similar to ieee80211_iterate_stations_atomic() but allows the iterator to sleep. This is needed for adding SDIO support to the rtw88 driver. Some interators there are reading or writing registers. With the SDIO ops (sdio_readb, sdio_writeb and friends) this means that the iterator function may sleep. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Link: https://lore.kernel.org/r/20211228211501.468981-2-martin.blumenstingl@googlemail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: allow non-standard VHT MCS-10/11Ping-Ke Shih1-1/+1
Some AP can possibly try non-standard VHT rate and mac80211 warns and drops packets, and leads low TCP throughput. Rate marked as a VHT rate but data is invalid: MCS: 10, NSS: 2 WARNING: CPU: 1 PID: 7817 at net/mac80211/rx.c:4856 ieee80211_rx_list+0x223/0x2f0 [mac8021 Since commit c27aa56a72b8 ("cfg80211: add VHT rate entries for MCS-10 and MCS-11") has added, mac80211 adds this support as well. After this patch, throughput is good and iw can get the bitrate: rx bitrate: 975.1 MBit/s VHT-MCS 10 80MHz short GI VHT-NSS 2 or rx bitrate: 1083.3 MBit/s VHT-MCS 11 80MHz short GI VHT-NSS 2 Buglink: https://bugzilla.suse.com/show_bug.cgi?id=1192891 Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20220103013623.17052-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_meshPavel Skripkin3-81/+54
Syzbot hit NULL deref in rhashtable_free_and_destroy(). The problem was in mesh_paths and mpp_paths being NULL. mesh_pathtbl_init() could fail in case of memory allocation failure, but nobody cared, since ieee80211_mesh_init_sdata() returns void. It led to leaving 2 pointers as NULL. Syzbot has found null deref on exit path, but it could happen anywhere else, because code assumes these pointers are valid. Since all ieee80211_*_setup_sdata functions are void and do not fail, let's embedd mesh_paths and mpp_paths into parent struct to avoid adding error handling on higher levels and follow the pattern of others setup_sdata functions Fixes: 60854fd94573 ("mac80211: mesh: convert path table to rhashtable") Reported-and-tested-by: syzbot+860268315ba86ea6b96b@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20211230195547.23977-1-paskripkin@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: initialize variable have_higher_than_11mbitTom Rix1-1/+1
Clang static analysis reports this warnings mlme.c:5332:7: warning: Branch condition evaluates to a garbage value have_higher_than_11mbit) ^~~~~~~~~~~~~~~~~~~~~~~ have_higher_than_11mbit is only set to true some of the time in ieee80211_get_rates() but is checked all of the time. So have_higher_than_11mbit needs to be initialized to false. Fixes: 5d6a1b069b7f ("mac80211: set basic rates earlier") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211223162848.3243702-1-trix@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04Namespaceify mtu_expires sysctlxu xin1-10/+11
This patch enables the sysctl mtu_expires to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Namespaceify min_pmtu sysctlxu xin1-16/+37
This patch enables the sysctl min_pmtu to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04sch_qfq: prevent shift-out-of-bounds in qfq_init_qdiscEric Dumazet1-4/+2
tx_queue_len can be set to ~0U, we need to be more careful about overflows. __fls(0) is undefined, as this report shows: UBSAN: shift-out-of-bounds in net/sched/sch_qfq.c:1430:24 shift exponent 51770272 is too large for 32-bit type 'int' CPU: 0 PID: 25574 Comm: syz-executor.0 Not tainted 5.16.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x201/0x2d8 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:151 [inline] __ubsan_handle_shift_out_of_bounds+0x494/0x530 lib/ubsan.c:330 qfq_init_qdisc+0x43f/0x450 net/sched/sch_qfq.c:1430 qdisc_create+0x895/0x1430 net/sched/sch_api.c:1253 tc_modify_qdisc+0x9d9/0x1e20 net/sched/sch_api.c:1660 rtnetlink_rcv_msg+0x934/0xe60 net/core/rtnetlink.c:5571 netlink_rcv_skb+0x200/0x470 net/netlink/af_netlink.c:2496 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x814/0x9f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0xaea/0xe60 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] ____sys_sendmsg+0x5b9/0x910 net/socket.c:2409 ___sys_sendmsg net/socket.c:2463 [inline] __sys_sendmsg+0x280/0x370 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04netrom: fix copying in user data in nr_setsockoptChristoph Hellwig1-1/+1
This code used to copy in an unsigned long worth of data before the sockptr_t conversion, so restore that. Fixes: a7b75c5a8c41 ("net: pass a sockptr_t into ->setsockopt") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04udp6: Use Segment Routing Header for dest address if presentAndrew Lunn1-1/+2
When finding the socket to report an error on, if the invoking packet is using Segment Routing, the IPv6 destination address is that of an intermediate router, not the end destination. Extract the ultimate destination address from the segment address. This change allows traceroute to function in the presence of Segment Routing. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04icmp: ICMPV6: Examine invoking packet for Segment Route Headers.Andrew Lunn2-1/+35
RFC8754 says: ICMP error packets generated within the SR domain are sent to source nodes within the SR domain. The invoking packet in the ICMP error message may contain an SRH. Since the destination address of a packet with an SRH changes as each segment is processed, it may not be the destination used by the socket or application that generated the invoking packet. For the source of an invoking packet to process the ICMP error message, the ultimate destination address of the IPv6 header may be required. The following logic is used to determine the destination address for use by protocol-error handlers. * Walk all extension headers of the invoking IPv6 packet to the routing extension header preceding the upper-layer header. - If routing header is type 4 Segment Routing Header (SRH) o The SID at Segment List[0] may be used as the destination address of the invoking packet. Mangle the skb so the network header points to the invoking packet inside the ICMP packet. The seg6 helpers can then be used on the skb to find any segment routing headers. If found, mark this fact in the IPv6 control block of the skb, and store the offset into the packet of the SRH. Then restore the skb back to its old state. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04seg6: export get_srh() for ICMP handlingAndrew Lunn2-31/+31
An ICMP error message can contain in its message body part of an IPv6 packet which invoked the error. Such a packet might contain a segment router header. Export get_srh() so the ICMP code can make use of it. Since his changes the scope of the function from local to global, add the seg6_ prefix to keep the namespace clean. And move it into seg6.c so it is always available, not just when IPV6_SEG6_LWTUNNEL is enabled. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04net: openvswitch: Fill act ct extensionPaul Blakey1-0/+6
To give drivers the originating device information for optimized connection tracking offload, fill in act ct extension with ifindex from skb. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04net/sched: act_ct: Fill offloading tuple iifidxPaul Blakey2-1/+32
Driver offloading ct tuples can use the information of which devices received the packets that created the offloaded connections, to more efficiently offload them only to the relevant device. Add new act_ct nf conntrack extension, which is used to store the skb devices before offloading the connection, and then fill in the tuple iifindex so drivers can get the device via metadata dissector match. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Merge tag 'batadv-next-pullrequest-20220103' of ↵Jakub Kicinski3-22/+18
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - allow netlink usage in unprivileged containers, by Linus Lüssing - remove unneeded variable, by Minghao Chi * tag 'batadv-next-pullrequest-20220103' of git://git.open-mesh.org/linux-merge: batman-adv: remove unneeded variable in batadv_nc_init batman-adv: allow netlink usage in unprivileged containers batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20220103171722.1126109-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-04Merge tag 'batadv-net-pullrequest-20220103' of ↵Jakub Kicinski3-11/+21
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - avoid sending link-local multicast to multicast routers, by Linus Lüssing * tag 'batadv-net-pullrequest-20220103' of git://git.open-mesh.org/linux-merge: batman-adv: mcast: don't send link-local multicast to mcast routers ==================== Link: https://lore.kernel.org/r/20220103171203.1124980-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03ipv6: Do cleanup if attribute validation fails in multipath routeDavid Ahern1-5/+3
As Nicolas noted, if gateway validation fails walking the multipath attribute the code should jump to the cleanup to free previously allocated memory. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103170555.94638-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03ipv6: Continue processing multipath route even if gateway attribute is invalidDavid Ahern1-2/+5
ip6_route_multipath_del loop continues processing the multipath attribute even if delete of a nexthop path fails. For consistency, do the same if the gateway attribute is invalid. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103171911.94739-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-02net/smc: add comments for smc_link_{usable|sendable}Dust Li1-1/+16
Add comments for both smc_link_sendable() and smc_link_usable() to help better distinguish and use them. No function changes. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02sctp: hold endpoint before calling cb in sctp_transport_lookup_processXin Long2-32/+36
The same fix in commit 5ec7d18d1813 ("sctp: use call_rcu to free endpoint") is also needed for dumping one asoc and sock after the lookup. Fixes: 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net: socket.c: style fixHamish MacDonald1-1/+1
Removed spaces and added a tab that was causing an error on checkpatch Signed-off-by: Hamish MacDonald <elusivenode@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02mctp: Remove only static neighbour on RTM_DELNEIGHGagan Kumar1-4/+5
Add neighbour source flag in mctp_neigh_remove(...) to allow removal of only static neighbours. This should be a no-op change and might be useful later when mctp can have MCTP_NEIGH_DISCOVER neighbours. Signed-off-by: Gagan Kumar <gagan1kumar.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02ipv6: ioam: Support for Queue depth data fieldJustin Iurman1-1/+15
v3: - Report 'backlog' (bytes) instead of 'qlen' (number of packets) v2: - Fix sparse warning (use rcu_dereference) This patch adds support for the queue depth in IOAM trace data fields. The draft [1] says the following: The "queue depth" field is a 4-octet unsigned integer field. This field indicates the current length of the egress interface queue of the interface from where the packet is forwarded out. The queue depth is expressed as the current amount of memory buffers used by the queue (a packet could consume one or more memory buffers, depending on its size). An existing function (i.e., qdisc_qstats_qlen_backlog) is used to retrieve the current queue length without reinventing the wheel. Note: it was tested and qlen is increasing when an artificial delay is added on the egress with tc. [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.7 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: remove redundant re-assignment of pointer linkColin Ian King1-1/+0
The pointer link is being re-assigned the same value that it was initialized with in the previous declaration statement. The re-assignment is redundant and can be removed. Fixes: 387707fdf486 ("net/smc: convert static link ID to dynamic references") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Introduce TCP ULP supportTony Lu1-7/+86
This implements TCP ULP for SMC, helps applications to replace TCP with SMC protocol in place. And we use it to implement transparent replacement. This replaces original TCP sockets with SMC, reuse TCP as clcsock when calling setsockopt with TCP_ULP option, and without any overhead. To replace TCP sockets with SMC, there are two approaches: - use setsockopt() syscall with TCP_ULP option, if error, it would fallback to TCP. - use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to replace transparently. BPF hooks some points in create socket, bind and others, users can inject their BPF logics without modifying their applications, and choose which connections should be replaced with SMC by calling setsockopt() in BPF prog, based on rules, such as TCP tuples, PID, cgroup, etc... BPF doesn't support calling setsockopt with TCP_ULP now, I will send the patches after this accepted. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Add net namespace for tracepointsTony Lu1-7/+16
This prints net namespace ID, helps us to distinguish different net namespaces when using tracepoints. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Print net namespace in logTony Lu2-9/+14
This adds net namespace ID to the kernel log, net_cookie is unique in the whole system. It is useful in container environment. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Add netlink net namespace supportTony Lu2-7/+12
This adds net namespace ID to diag of linkgroup, helps us to distinguish different namespaces, and net_cookie is unique in the whole system. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Introduce net namespace support for linkgroupTony Lu4-12/+42
Currently, rdma device supports exclusive net namespace isolation, however linkgroup doesn't know and support ibdev net namespace. Applications in the containers don't want to share the nics if we enabled rdma exclusive mode. Every net namespaces should have their own linkgroups. This patch introduce a new field net for linkgroup, which is standing for the ibdev net namespace in the linkgroup. The net in linkgroup is initialized with the net namespace of link's ibdev. It compares the net of linkgroup and sock or ibdev before choose it, if no matched, create new one in current net namespace. If rdma net namespace exclusive mode is not enabled, it behaves as before. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02batman-adv: mcast: don't send link-local multicast to mcast routersLinus Lüssing3-11/+21
The addition of routable multicast TX handling introduced a bug/regression for packets with a link-local multicast destination: These packets would be sent to all batman-adv nodes with a multicast router and to all batman-adv nodes with an old version without multicast router detection. This even disregards the batman-adv multicast fanout setting, which can potentially lead to an unwanted, high number of unicast transmissions or even congestion. Fixing this by avoiding to send link-local multicast packets to nodes in the multicast router list. Fixes: 11d458c1cb9b ("batman-adv: mcast: apply optimizations for routable packets, too") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-01-01net ticp:fix a kernel-infoleak in __tipc_sendmsg()Haimin Zhang1-0/+2
struct tipc_socket_addr.ref has a 4-byte hole,and __tipc_getname() currently copying it to user space,causing kernel-infoleak. BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline] BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline] lib/usercopy.c:33 BUG: KMSAN: kernel-infoleak in _copy_to_user+0x1c9/0x270 lib/usercopy.c:33 lib/usercopy.c:33 instrument_copy_to_user include/linux/instrumented.h:121 [inline] instrument_copy_to_user include/linux/instrumented.h:121 [inline] lib/usercopy.c:33 _copy_to_user+0x1c9/0x270 lib/usercopy.c:33 lib/usercopy.c:33 copy_to_user include/linux/uaccess.h:209 [inline] copy_to_user include/linux/uaccess.h:209 [inline] net/socket.c:287 move_addr_to_user+0x3f6/0x600 net/socket.c:287 net/socket.c:287 __sys_getpeername+0x470/0x6b0 net/socket.c:1987 net/socket.c:1987 __do_sys_getpeername net/socket.c:1997 [inline] __se_sys_getpeername net/socket.c:1994 [inline] __do_sys_getpeername net/socket.c:1997 [inline] net/socket.c:1994 __se_sys_getpeername net/socket.c:1994 [inline] net/socket.c:1994 __x64_sys_getpeername+0xda/0x120 net/socket.c:1994 net/socket.c:1994 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_x64 arch/x86/entry/common.c:51 [inline] arch/x86/entry/common.c:82 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was stored to memory at: tipc_getname+0x575/0x5e0 net/tipc/socket.c:757 net/tipc/socket.c:757 __sys_getpeername+0x3b3/0x6b0 net/socket.c:1984 net/socket.c:1984 __do_sys_getpeername net/socket.c:1997 [inline] __se_sys_getpeername net/socket.c:1994 [inline] __do_sys_getpeername net/socket.c:1997 [inline] net/socket.c:1994 __se_sys_getpeername net/socket.c:1994 [inline] net/socket.c:1994 __x64_sys_getpeername+0xda/0x120 net/socket.c:1994 net/socket.c:1994 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_x64 arch/x86/entry/common.c:51 [inline] arch/x86/entry/common.c:82 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was stored to memory at: msg_set_word net/tipc/msg.h:212 [inline] msg_set_destport net/tipc/msg.h:619 [inline] msg_set_word net/tipc/msg.h:212 [inline] net/tipc/socket.c:1486 msg_set_destport net/tipc/msg.h:619 [inline] net/tipc/socket.c:1486 __tipc_sendmsg+0x44fa/0x5890 net/tipc/socket.c:1486 net/tipc/socket.c:1486 tipc_sendmsg+0xeb/0x140 net/tipc/socket.c:1402 net/tipc/socket.c:1402 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] sock_sendmsg_nosec net/socket.c:704 [inline] net/socket.c:2409 sock_sendmsg net/socket.c:724 [inline] net/socket.c:2409 ____sys_sendmsg+0xe11/0x12c0 net/socket.c:2409 net/socket.c:2409 ___sys_sendmsg net/socket.c:2463 [inline] ___sys_sendmsg net/socket.c:2463 [inline] net/socket.c:2492 __sys_sendmsg+0x704/0x840 net/socket.c:2492 net/socket.c:2492 __do_sys_sendmsg net/socket.c:2501 [inline] __se_sys_sendmsg net/socket.c:2499 [inline] __do_sys_sendmsg net/socket.c:2501 [inline] net/socket.c:2499 __se_sys_sendmsg net/socket.c:2499 [inline] net/socket.c:2499 __x64_sys_sendmsg+0xe2/0x120 net/socket.c:2499 net/socket.c:2499 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_x64 arch/x86/entry/common.c:51 [inline] arch/x86/entry/common.c:82 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Local variable skaddr created at: __tipc_sendmsg+0x2d0/0x5890 net/tipc/socket.c:1419 net/tipc/socket.c:1419 tipc_sendmsg+0xeb/0x140 net/tipc/socket.c:1402 net/tipc/socket.c:1402 Bytes 4-7 of 16 are uninitialized Memory access of size 16 starts at ffff888113753e00 Data copied to user address 0000000020000280 Reported-by: syzbot+cdbd40e0c3ca02cae3b7@syzkaller.appspotmail.com Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/1640918123-14547-1-git-send-email-tcs.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-2/+2
Daniel Borkmann says: ==================== pull-request: bpf 2021-12-31 We've added 2 non-merge commits during the last 14 day(s) which contain a total of 2 files changed, 3 insertions(+), 3 deletions(-). The main changes are: 1) Revert of an earlier attempt to fix xsk's poll() behavior where it turned out that the fix for a rare problem made it much worse in general, from Magnus Karlsson. (Fyi, Magnus mentioned that a proper fix is coming early next year, so the revert is mainly to avoid slipping the behavior into 5.16.) 2) Minor misc spell fix in BPF selftests, from Colin Ian King. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, selftests: Fix spelling mistake "tained" -> "tainted" Revert "xsk: Do not sleep in poll() when need_wakeup set" ==================== Link: https://lore.kernel.org/r/20211231160050.16105-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>