summaryrefslogtreecommitdiff
path: root/net/core/rtnetlink.c
AgeCommit message (Collapse)AuthorFilesLines
2023-06-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-4/+4
Merge in late fixes to prepare for the 6.5 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23netlink: do not hard code device address lenth in fdb dumpsEric Dumazet1-4/+4
syzbot reports that some netdev devices do not have a six bytes address [1] Replace ETH_ALEN by dev->addr_len. [1] (Case of a device where dev->addr_len = 4) BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline] BUG: KMSAN: kernel-infoleak in copyout+0xb8/0x100 lib/iov_iter.c:169 instrument_copy_to_user include/linux/instrumented.h:114 [inline] copyout+0xb8/0x100 lib/iov_iter.c:169 _copy_to_iter+0x6d8/0x1d00 lib/iov_iter.c:536 copy_to_iter include/linux/uio.h:206 [inline] simple_copy_to_iter+0x68/0xa0 net/core/datagram.c:513 __skb_datagram_iter+0x123/0xdc0 net/core/datagram.c:419 skb_copy_datagram_iter+0x5c/0x200 net/core/datagram.c:527 skb_copy_datagram_msg include/linux/skbuff.h:3960 [inline] netlink_recvmsg+0x4ae/0x15a0 net/netlink/af_netlink.c:1970 sock_recvmsg_nosec net/socket.c:1019 [inline] sock_recvmsg net/socket.c:1040 [inline] ____sys_recvmsg+0x283/0x7f0 net/socket.c:2722 ___sys_recvmsg+0x223/0x840 net/socket.c:2764 do_recvmmsg+0x4f9/0xfd0 net/socket.c:2858 __sys_recvmmsg net/socket.c:2937 [inline] __do_sys_recvmmsg net/socket.c:2960 [inline] __se_sys_recvmmsg net/socket.c:2953 [inline] __x64_sys_recvmmsg+0x397/0x490 net/socket.c:2953 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was stored to memory at: __nla_put lib/nlattr.c:1009 [inline] nla_put+0x1c6/0x230 lib/nlattr.c:1067 nlmsg_populate_fdb_fill+0x2b8/0x600 net/core/rtnetlink.c:4071 nlmsg_populate_fdb net/core/rtnetlink.c:4418 [inline] ndo_dflt_fdb_dump+0x616/0x840 net/core/rtnetlink.c:4456 rtnl_fdb_dump+0x14ff/0x1fc0 net/core/rtnetlink.c:4629 netlink_dump+0x9d1/0x1310 net/netlink/af_netlink.c:2268 netlink_recvmsg+0xc5c/0x15a0 net/netlink/af_netlink.c:1995 sock_recvmsg_nosec+0x7a/0x120 net/socket.c:1019 ____sys_recvmsg+0x664/0x7f0 net/socket.c:2720 ___sys_recvmsg+0x223/0x840 net/socket.c:2764 do_recvmmsg+0x4f9/0xfd0 net/socket.c:2858 __sys_recvmmsg net/socket.c:2937 [inline] __do_sys_recvmmsg net/socket.c:2960 [inline] __se_sys_recvmmsg net/socket.c:2953 [inline] __x64_sys_recvmmsg+0x397/0x490 net/socket.c:2953 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was created at: slab_post_alloc_hook+0x12d/0xb60 mm/slab.h:716 slab_alloc_node mm/slub.c:3451 [inline] __kmem_cache_alloc_node+0x4ff/0x8b0 mm/slub.c:3490 kmalloc_trace+0x51/0x200 mm/slab_common.c:1057 kmalloc include/linux/slab.h:559 [inline] __hw_addr_create net/core/dev_addr_lists.c:60 [inline] __hw_addr_add_ex+0x2e5/0x9e0 net/core/dev_addr_lists.c:118 __dev_mc_add net/core/dev_addr_lists.c:867 [inline] dev_mc_add+0x9a/0x130 net/core/dev_addr_lists.c:885 igmp6_group_added+0x267/0xbc0 net/ipv6/mcast.c:680 ipv6_mc_up+0x296/0x3b0 net/ipv6/mcast.c:2754 ipv6_mc_remap+0x1e/0x30 net/ipv6/mcast.c:2708 addrconf_type_change net/ipv6/addrconf.c:3731 [inline] addrconf_notify+0x4d3/0x1d90 net/ipv6/addrconf.c:3699 notifier_call_chain kernel/notifier.c:93 [inline] raw_notifier_call_chain+0xe4/0x430 kernel/notifier.c:461 call_netdevice_notifiers_info net/core/dev.c:1935 [inline] call_netdevice_notifiers_extack net/core/dev.c:1973 [inline] call_netdevice_notifiers+0x1ee/0x2d0 net/core/dev.c:1987 bond_enslave+0xccd/0x53f0 drivers/net/bonding/bond_main.c:1906 do_set_master net/core/rtnetlink.c:2626 [inline] rtnl_newlink_create net/core/rtnetlink.c:3460 [inline] __rtnl_newlink net/core/rtnetlink.c:3660 [inline] rtnl_newlink+0x378c/0x40e0 net/core/rtnetlink.c:3673 rtnetlink_rcv_msg+0x16a6/0x1840 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x371/0x650 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0xf28/0x1230 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x122f/0x13d0 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x999/0xd50 net/socket.c:2503 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2557 __sys_sendmsg net/socket.c:2586 [inline] __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x304/0x490 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Bytes 2856-2857 of 3500 are uninitialized Memory access of size 3500 starts at ffff888018d99104 Data copied to user address 0000000020000480 Fixes: d83b06036048 ("net: add fdb generic dump routine") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230621174720.1845040-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15rtnetlink: move validate_linkmsg out of do_setlinkXin Long1-41/+42
This patch moves validate_linkmsg() out of do_setlink() to its callers and deletes the early validate_linkmsg() call in __rtnl_newlink(), so that it will not call validate_linkmsg() twice in either of the paths: - __rtnl_newlink() -> do_setlink() - __rtnl_newlink() -> rtnl_newlink_create() -> rtnl_create_link() Additionally, as validate_linkmsg() is now only called with a real dev, we can remove the NULL check for dev in validate_linkmsg(). Note that we moved validate_linkmsg() check to the places where it has not done any changes to the dev, as Jakub suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/cf2ef061e08251faf9e8be25ff0d61150c030475.1686585334.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-14rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFOEdwin Peer1-45/+51
This filter already exists for excluding IPv6 SNMP stats. Extend its definition to also exclude IFLA_VF_INFO stats in RTM_GETLINK. This patch constitutes a partial fix for a netlink attribute nesting overflow bug in IFLA_VFINFO_LIST. By excluding the stats when the requester doesn't need them, the truncation of the VF list is avoided. While it was technically only the stats added in commit c5a9f6f0ab40 ("net/core: Add drop counters to VF statistics") breaking the camel's back, the appreciable size of the stats data should never have been included without due consideration for the maximum number of VFs supported by PCI. Fixes: 3b766cd83232 ("net/core: Add reading VF statistics through the PF netdevice") Fixes: c5a9f6f0ab40 ("net/core: Add drop counters to VF statistics") Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Cc: Edwin Peer <espeer@gmail.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20230611105108.122586-1-gal@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-01rtnetlink: add the missing IFLA_GRO_ tb check in validate_linkmsgXin Long1-0/+12
This fixes the issue that dev gro_max_size and gso_ipv4_max_size can be set to a huge value: # ip link add dummy1 type dummy # ip link set dummy1 gro_max_size 4294967295 # ip -d link show dummy1 dummy addrgenmode eui64 ... gro_max_size 4294967295 Fixes: 0fe79f28bfaf ("net: allow gro_max_size to exceed 65536") Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01rtnetlink: move IFLA_GSO_ tb check to validate_linkmsgXin Long1-15/+19
These IFLA_GSO_* tb check should also be done for the new created link, otherwise, they can be set to a huge value when creating links: # ip link add dummy1 gso_max_size 4294967295 type dummy # ip -d link show dummy1 dummy addrgenmode eui64 ... gso_max_size 4294967295 Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01rtnetlink: call validate_linkmsg in rtnl_create_linkXin Long1-1/+7
validate_linkmsg() was introduced by commit 1840bb13c22f5b ("[RTNL]: Validate hardware and broadcast address attribute for RTM_NEWLINK") to validate tb[IFLA_ADDRESS/BROADCAST] for existing links. The same check should also be done for newly created links. This patch adds validate_linkmsg() call in rtnl_create_link(), to avoid the invalid address set when creating some devices like: # ip link add dummy0 type dummy # ip link add link dummy0 name mac0 address 01:02 type macsec Fixes: 0e06877c6fdb ("[RTNETLINK]: rtnl_link: allow specifying initial device address") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21bridge: Allow setting per-{Port, VLAN} neighbor suppression stateIdo Schimmel1-1/+1
Add a new bridge port attribute that allows user space to enable per-{Port, VLAN} neighbor suppression. Example: # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' false # bridge link set dev swp1 neigh_vlan_suppress on # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' true # bridge link set dev swp1 neigh_vlan_suppress off # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' false Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+9
Conflicts: tools/testing/selftests/net/config 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13rtnetlink: Restore RTM_NEW/DELLINK notification behaviorMartin Willi1-2/+9
The commits referenced below allows userspace to use the NLM_F_ECHO flag for RTM_NEW/DELLINK operations to receive unicast notifications for the affected link. Prior to these changes, applications may have relied on multicast notifications to learn the same information without specifying the NLM_F_ECHO flag. For such applications, the mentioned commits changed the behavior for requests not using NLM_F_ECHO. Multicast notifications are still received, but now use the portid of the requester and the sequence number of the request instead of zero values used previously. For the application, this message may be unexpected and likely handled as a response to the NLM_F_ACKed request, especially if it uses the same socket to handle requests and notifications. To fix existing applications relying on the old notification behavior, set the portid and sequence number in the notification only if the request included the NLM_F_ECHO flag. This restores the old behavior for applications not using it, but allows unicasted notifications for others. Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link") Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create") Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Guillaume Nault <gnault@redhat.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-29net: dst: Switch to rcuref_t reference countingThomas Gleixner1-1/+1
Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17rtnetlink: bridge: mcast: Relax group address validation in common codeIdo Schimmel1-2/+3
In the upcoming VXLAN MDB implementation, the 0.0.0.0 and :: MDB entries will act as catchall entries for unregistered IP multicast traffic in a similar fashion to the 00:00:00:00:00:00 VXLAN FDB entry that is used to transmit BUM traffic. In deployments where inter-subnet multicast forwarding is used, not all the VTEPs in a tenant domain are members in all the broadcast domains. It is therefore advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. Prepare for this change by allowing the 0.0.0.0 group address in the common rtnetlink MDB code and forbid it in the bridge driver. A similar change is not needed for IPv6 because the common code only validates that the group address is not the all-nodes address. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17rtnetlink: bridge: mcast: Move MDB handlers out of bridge driverIdo Schimmel1-0/+217
Currently, the bridge driver registers handlers for MDB netlink messages, making it impossible for other drivers to implement MDB support. As a preparation for VXLAN MDB support, move the MDB handlers out of the bridge driver to the core rtnetlink code. The rtnetlink code will call into individual drivers by invoking their previously added MDB net device operations. Note that while the diffstat is large, the change is mechanical. It moves code out of the bridge driver to rtnetlink code. Also note that a similar change was made in 2012 with commit 77162022ab26 ("net: add generic PF_BRIDGE:RTM_ FDB hooks") that moved FDB handlers out of the bridge driver to the core rtnetlink code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-06net: bridge: Add netlink knobs for number / maximum MDB entriesPetr Machata1-1/+1
The previous patch added accounting for number of MDB entries per port and per port-VLAN, and the logic to verify that these values stay within configured bounds. However it didn't provide means to actually configure those bounds or read the occupancy. This patch does that. Two new netlink attributes are added for the MDB occupancy: IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy. And another two for the maximum number of MDB entries: IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one. Note that the two new IFLA_BRPORT_ attributes prompt bumping of RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough. The new attributes are used like this: # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \ mcast_vlan_snooping 1 mcast_querier 1 # ip link set dev v1 master br # bridge vlan add dev v1 vid 2 # bridge vlan set dev v1 vid 1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1. # bridge link set dev v1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2 Error: bridge: Port is already in 1 groups, and mcast_max_groups=1. # bridge -d link show 5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...] [...] mcast_n_groups 1 mcast_max_groups 1 # bridge -d vlan show port vlan-id br 1 PVID Egress Untagged state forwarding mcast_router 1 v1 1 PVID Egress Untagged [...] mcast_n_groups 1 mcast_max_groups 1 2 [...] mcast_n_groups 0 mcast_max_groups 0 Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-02net: add gso_ipv4_max_size and gro_ipv4_max_size per deviceXin Long1-0/+33
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-04net: expose devlink port over rtnetlinkJiri Pirko1-0/+39
Expose devlink port handle related to netdev over rtnetlink. Introduce a new nested IFLA attribute to carry the info. Call into devlink code to fill-up the nest with existing devlink attributes that are used over devlink netlink. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-04bridge: Add MAC Authentication Bypass (MAB) supportHans J. Schultz1-0/+5
Hosts that support 802.1X authentication are able to authenticate themselves by exchanging EAPOL frames with an authenticator (Ethernet bridge, in this case) and an authentication server. Access to the network is only granted by the authenticator to successfully authenticated hosts. The above is implemented in the bridge using the "locked" bridge port option. When enabled, link-local frames (e.g., EAPOL) can be locally received by the bridge, but all other frames are dropped unless the host is authenticated. That is, unless the user space control plane installed an FDB entry according to which the source address of the frame is located behind the locked ingress port. The entry can be dynamic, in which case learning needs to be enabled so that the entry will be refreshed by incoming traffic. There are deployments in which not all the devices connected to the authenticator (the bridge) support 802.1X. Such devices can include printers and cameras. One option to support such deployments is to unlock the bridge ports connecting these devices, but a slightly more secure option is to use MAB. When MAB is enabled, the MAC address of the connected device is used as the user name and password for the authentication. For MAB to work, the user space control plane needs to be notified about MAC addresses that are trying to gain access so that they will be compared against an allow list. This can be implemented via the regular learning process with the sole difference that learned FDB entries are installed with a new "locked" flag indicating that the entry cannot be used to authenticate the device. The flag cannot be set by user space, but user space can clear the flag by replacing the entry, thereby authenticating the device. Locked FDB entries implement the following semantics with regards to roaming, aging and forwarding: 1. Roaming: Locked FDB entries can roam to unlocked (authorized) ports, in which case the "locked" flag is cleared. FDB entries cannot roam to locked ports regardless of MAB being enabled or not. Therefore, locked FDB entries are only created if an FDB entry with the given {MAC, VID} does not already exist. This behavior prevents unauthenticated devices from disrupting traffic destined to already authenticated devices. 2. Aging: Locked FDB entries age and refresh by incoming traffic like regular entries. 3. Forwarding: Locked FDB entries forward traffic like regular entries. If user space detects an unauthorized MAC behind a locked port and wishes to prevent traffic with this MAC DA from reaching the host, it can do so using tc or a different mechanism. Enable the above behavior using a new bridge port option called "mab". It can only be enabled on a bridge port that is both locked and has learning enabled. Locked FDB entries are flushed from the port once MAB is disabled. A new option is added because there are pure 802.1X deployments that are not interested in notifications about locked FDB entries. Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-01rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_linkHangbin Liu1-3/+4
This patch use the new helper unregister_netdevice_many_notify() for rtnl_delete_link(), so that the kernel could reply unicast when userspace set NLM_F_ECHO flag to request the new created interface info. At the same time, the parameters of rtnl_delete_link() need to be updated since we need nlmsghdr and portid info. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-01rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_createHangbin Liu1-2/+4
This patch pass the netlink header message in rtnl_newlink_create() to the new updated rtnl_configure_link(), so that the kernel could reply unicast when userspace set NLM_F_ECHO flag to request the new created interface info. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-01rtnetlink: pass netlink message header and portid to rtnl_configure_link()Hangbin Liu1-16/+19
This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20net: rtnetlink: Enslave device before bringing it upPhil Sutter1-7/+7
Unlike with bridges, one can't add an interface to a bond and set it up at the same time: | # ip link set dummy0 down | # ip link set dummy0 master bond0 up | Error: Device can not be enslaved while up. Of all drivers with ndo_add_slave callback, bond and team decline if IFF_UP flag is set, vrf cycles the interface (i.e., sets it down and immediately up again) and the others just don't care. Support the common notion of setting the interface up after enslaving it by sorting the operations accordingly. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220914150623.24152-1-phil@nwl.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-16rtnetlink: advertise allmulti counterNicolas Dichtel1-0/+3
Like what was done with IFLA_PROMISCUITY, add IFLA_ALLMULTI to advertise the allmulti counter. The flag IFF_ALLMULTI is advertised only if it was directly set by a userland app. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: rtnetlink: use netif_oper_up instead of open codeJuhee Kang1-4/+2
The open code is defined as a new helper function(netif_oper_up) on netdev.h, the code is dev->operstate == IF_OPER_UP || dev->operstate == IF_OPER_UNKNOWN. Thus, replace the open code to netif_oper_up. This patch doesn't change logic. Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Link: https://lore.kernel.org/r/20220831125845.1333-1-claudiajkang@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-16net: rtnetlink: fix module reference count leak issue in rtnetlink_rcv_msgZhengchao Shao1-0/+1
When bulk delete command is received in the rtnetlink_rcv_msg function, if bulk delete is not supported, module_put is not called to release the reference counting. As a result, module reference count is leaked. Fixes: a6cec0bcd342 ("net: rtnetlink: add bulk delete support flag") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220815024629.240367-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16net: allow gro_max_size to exceed 65536Alexander Duyck1-8/+0
Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gso_max_size to exceed 65536Alexander Duyck1-1/+1
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: add IFLA_TSO_{MAX_SIZE|SEGS} attributesEric Dumazet1-0/+6
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS are used to report to user-space the device TSO limits. ip -d link sh dev eth1 ... tso_max_size 65536 tso_max_segs 65535 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-12rtnetlink: verify rate parameters for calls to ndo_set_vf_rateBin Chen1-10/+18
When calling ndo_set_vf_rate() the max_tx_rate parameter may be zero, in which case the setting is cleared, or it must be greater or equal to min_tx_rate. Enforce this requirement on all calls to ndo_set_vf_rate via a wrapper which also only calls ndo_set_vf_rate() if defined by the driver. Based on work by Jakub Kicinski <kuba@kernel.org> Signed-off-by: Bin Chen <bin.chen@corigine.com> Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-09rtnetlink: add extack support in fdb del handlersAlaa Mohamed1-2/+2
Add extack support to .ndo_fdb_del in netdevice.h and all related methods. Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-06net: don't allow user space to lift the device limitsJakub Kicinski1-2/+2
Up until commit 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") the gso_max_segs and gso_max_size of a device were not controlled from user space. The quoted commit added the ability to control them because of the following setup: netns A | netns B veth<->veth eth0 If eth0 has TSO limitations and user wants to efficiently forward traffic between eth0 and the veths they should copy the TSO limitations of eth0 onto the veths. This would happen automatically for macvlans or ipvlan but veth users are not so lucky (given the loose coupling). Unfortunately the commit in question allowed users to also override the limits on real HW devices. It may be useful to control the max GSO size and someone may be using that ability (not that I know of any user), so create a separate set of knobs to reliably record the TSO limitations. Validate the user requests. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-02rtnl: move rtnl_newlink_create()Jakub Kicinski1-91/+86
Pure code move. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-02rtnl: split __rtnl_newlink() into two functionsJakub Kicinski1-3/+20
__rtnl_newlink() is 250LoC, but has a few clear sections. Move the part which creates a new netdev to a separate function. For ease of review code will be moved in the next change. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-02rtnl: allocate more attr tables on the heapJakub Kicinski1-12/+18
Commit a293974590cf ("rtnetlink: avoid frame size warning in rtnl_newlink()") moved to allocating the largest attribute array of rtnl_newlink() on the heap. Kalle reports the stack has grown above 1k again: net/core/rtnetlink.c:3557:1: error: the frame size of 1104 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Move more attrs to the heap, wrap them in a struct. Don't bother with linkinfo, it's referenced a lot and we take its size so it's awkward to move, plus it's small (6 elements). Reported-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-22Revert "rtnetlink: return EINVAL when request cannot succeed"Florent Fourcot1-1/+1
This reverts commit b6177d3240a4 ip-link command is testing kernel capability by sending a RTM_NEWLINK request, without any argument. It accepts everything in reply, except EOPNOTSUPP and EINVAL (functions iplink_have_newlink / accept_msg) So we must keep compatiblity here, invalid empty message should not return EINVAL Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Tested-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-19rtnetlink: return EINVAL when request cannot succeedFlorent Fourcot1-1/+1
A request without interface name/interface index/interface group cannot work. We should return EINVAL Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellinkFlorent Fourcot1-1/+1
If IFLA_ALT_IFNAME is set and given interface is not found, we should return ENODEV and be consistent with IFLA_IFNAME behaviour This commit extends feature of commit 76c9ac0ee878, "net: rtnetlink: add possibility to use alternative names as message handle" CC: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: enable alt_ifname for setlink/newlinkFlorent Fourcot1-40/+29
buffer called "ifname" given in function rtnl_dev_get is always valid when called by setlink/newlink, but contains only empty string when IFLA_IFNAME is not given. So IFLA_ALT_IFNAME is always ignored This patch fixes rtnl_dev_get function with a remove of ifname argument, and move ifname copy in do_setlink when required. It extends feature of commit 76c9ac0ee878, "net: rtnetlink: add possibility to use alternative names as message handle"" CC: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: return ENODEV when ifname does not exist and group is givenFlorent Fourcot1-4/+14
When the interface does not exist, and a group is given, the given parameters are being set to all interfaces of the given group. The given IFNAME/ALT_IF_NAME are being ignored in that case. That can be dangerous since a typo (or a deleted interface) can produce weird side effects for caller: Case 1: IFLA_IFNAME=valid_interface IFLA_GROUP=1 MTU=1234 Case 1 will update MTU and group of the given interface "valid_interface". Case 2: IFLA_IFNAME=doesnotexist IFLA_GROUP=1 MTU=1234 Case 2 will update MTU of all interfaces in group 1. IFLA_IFNAME is ignored in this case This behaviour is not consistent and dangerous. In order to fix this issue, we now return ENODEV when the given IFNAME does not exist. Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni1-0/+3
2022-04-14rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS repliesPetr Machata1-0/+3
When L3 stats are disabled, rtnl_offload_xstats_get_size_stats() returns size of 0, which is supposed to be an indication that the corresponding attribute should not be emitted. However, instead, the current code reserves a 0-byte attribute. The reason this does not show up as a citation on a kasan kernel is that netdev_offload_xstats_get(), which is supposed to fill in the data, never ends up getting called, because rtnl_offload_xstats_get_stats() notices that the stats are not actually used and skips the call. Thus a zero-length IFLA_OFFLOAD_XSTATS_L3_STATS attribute ends up in a response, confusing the userspace. Fix by skipping the L3-stats related block in rtnl_offload_xstats_fill(). Fixes: 0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats") Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/591b58e7623edc3eb66dd1fcfa8c8f133d090974.1649794741.git.petrm@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-13net: rtnetlink: add ndm flags and state mask attributesNikolay Aleksandrov1-0/+2
Add ndm flags/state masks which will be used for bulk delete filtering. All of these are used by the bridge and vxlan drivers. Also minimal attr policy validation is added, it is up to ndo_fdb_del_bulk implementers to further validate them. Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: rtnetlink: add NLM_F_BULK support to rtnl_fdb_delNikolay Aleksandrov1-19/+48
When NLM_F_BULK is specified in a fdb del message we need to handle it differently. First since this is a new call we can strictly validate the passed attributes, at first only ifindex and vlan are allowed as these will be the initially supported filter attributes, any other attribute is rejected. The mac address is no longer mandatory, but we use it to error out in older kernels because it cannot be specified with bulk request (the attribute is not allowed) and then we have to dispatch the call to ndo_fdb_del_bulk if the device supports it. The del bulk callback can do further validation of the attributes if necessary. Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: rtnetlink: add bulk delete support flagNikolay Aleksandrov1-0/+8
Add a new rtnl flag (RTNL_FLAG_BULK_DEL_SUPPORTED) which is used to verify that the delete operation allows bulk object deletion. Also emit a warning if anyone tries to set it for non-delete kind. Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: rtnetlink: add helper to extract msg type's kindNikolay Aleksandrov1-1/+1
Add a helper which extracts the msg type's kind using the kind mask (0x3). Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: rtnetlink: add msg kind namesNikolay Aleksandrov1-3/+3
Add rtnl kind names instead of using raw values. We'll need to check for DEL kind later to validate bulk flag support. Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-08net: extract a few internals from netdevice.hJakub Kicinski1-0/+2
There's a number of functions and static variables used under net/core/ but not from the outside. We currently dump most of them into netdevice.h. That bad for many reasons: - netdevice.h is very cluttered, hard to figure out what the APIs are; - netdevice.h is very long; - we have to touch netdevice.h more which causes expensive incremental builds. Create a header under net/core/ and move some declarations. The new header is also a bit of a catch-all but that's fine, if we create more specific headers people will likely over-think where their declaration fit best. And end up putting them in netdevice.h, again. More work should be done on splitting netdevice.h into more targeted headers, but that'd be more time consuming so small steps. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-06net: ensure net_todo_list is processed quicklyJohannes Berg1-0/+33
In [1], Will raised a potential issue that the cfg80211 code, which does (from a locking perspective) rtnl_lock() wiphy_lock() rtnl_unlock() might be suspectible to ABBA deadlocks, because rtnl_unlock() calls netdev_run_todo(), which might end up calling rtnl_lock() again, which could then deadlock (see the comment in the code added here for the scenario). Some back and forth and thinking ensued, but clearly this can't happen if the net_todo_list is empty at the rtnl_unlock() here. Clearly, the code here cannot actually put an entry on it, and all other users of rtnl_unlock() will empty it since that will always go through netdev_run_todo(), emptying the list. So the only other way to get there would be to add to the list and then unlock the RTNL without going through rtnl_unlock(), which is only possible through __rtnl_unlock(). However, this isn't exported and not used in many places, and none of them seem to be able to unregister before using it. Therefore, add a WARN_ON() in the code to ensure this invariant won't be broken, so that the cfg80211 (or any similar) code stays safe. [1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11net: limit altnames to 64k totalJakub Kicinski1-0/+11
Property list (altname is a link "property") is wrapped in a nlattr. nlattrs length is 16bit so practically speaking the list of properties can't be longer than that, otherwise user space would have to interpret broken netlink messages. Prevent the problem from occurring by checking the length of the property list before adding new entries. Reported-by: George Shuklin <george.shuklin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11net: account alternate interface name memoryJakub Kicinski1-1/+1
George reports that altnames can eat up kernel memory. We should charge that memory appropriately. Reported-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-07net: rtnetlink: fix error handling in rtnl_fill_statsinfo()Tom Rix1-1/+1
The clang static analyzer reports this issue rtnetlink.c:5481:2: warning: Undefined or garbage value returned to caller return err; ^~~~~~~~~~ There is a function level err variable, in the list_for_each_entry_rcu block there is a shadow err. Remove the shadow. In the same block, the call to nla_nest_start_noflag() can fail without setting an err. Set the err to -EMSGSIZE. Fixes: 216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>