summaryrefslogtreecommitdiff
path: root/net/bridge
AgeCommit message (Collapse)AuthorFilesLines
2022-12-13bridge: mcast: Add a centralized error pathIdo Schimmel1-4/+6
Subsequent patches will add memory allocations in br_mdb_config_init() as the MDB configuration structure will include a linked list of source entries. This memory will need to be freed regardless if br_mdb_add() succeeded or failed. As a preparation for this change, add a centralized error path where the memory will be freed. Note that br_mdb_del() already has one error path and therefore does not require any changes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13bridge: mcast: Place netlink policy before validation functionsIdo Schimmel1-6/+6
Subsequent patches are going to add additional validation functions and netlink policies. Some of these functions will need to perform parsing using nla_parse_nested() and the new policies. In order to keep all the policies next to each other, move the current policy to before the validation functions. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13bridge: mcast: Split (*, G) and (S, G) addition into different functionsIdo Schimmel1-49/+96
When the bridge is using IGMP version 3 or MLD version 2, it handles the addition of (*, G) and (S, G) entries differently. When a new (S, G) port group entry is added, all the (*, G) EXCLUDE ports need to be added to the port group of the new entry. Similarly, when a new (*, G) EXCLUDE port group entry is added, the port needs to be added to the port group of all the matching (S, G) entries. Subsequent patches will create more differences between both entry types. Namely, filter mode and source list can only be specified for (*, G) entries. Given the current and future differences between both entry types, handle the addition of each entry type in a different function, thereby avoiding the creation of one complex function. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13bridge: mcast: Do not derive entry type from its filter modeIdo Schimmel1-6/+3
Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries cannot be set from user space. Instead, it is set by the kernel according to the entry type: (*, G) entries are treated as EXCLUDE and (S, G) entries are treated as INCLUDE. This allows the kernel to derive the entry type from its filter mode. Subsequent patches will allow user space to set the filter mode of (*, G) entries, making the current assumption incorrect. As a preparation, remove the current assumption and instead determine the entry type from its key, which is a more direct way. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski1-31/+1
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter. 2) Add DATA_SENT state to SCTP connection tracking helper, from Sriram Yagnaraman. 3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal. 4) Add bitmask support for ipset, from Vishwanath Pai. 5) Handle icmpv6 redirects as RELATED, from Florian Westphal. 6) Add WARN_ON_ONCE() to impossible case in flowtable datapath, from Li Qiong. 7) A large batch of IPVS updates to replace timer-based estimators by kthreads to scale up wrt. CPUs and workload (millions of estimators). Julian Anastasov says: This patchset implements stats estimation in kthread context. It replaces the code that runs on single CPU in timer context every 2 seconds and causing latency splats as shown in reports [1], [2], [3]. The solution targets setups with thousands of IPVS services, destinations and multi-CPU boxes. Spread the estimation on multiple (configured) CPUs and multiple time slots (timer ticks) by using multiple chains organized under RCU rules. When stats are not needed, it is recommended to use run_estimation=0 as already implemented before this change. RCU Locking: - As stats are now RCU-locked, tot_stats, svc and dest which hold estimator structures are now always freed from RCU callback. This ensures RCU grace period after the ip_vs_stop_estimator() call. Kthread data: - every kthread works over its own data structure and all such structures are attached to array. For now we limit kthreads depending on the number of CPUs. - even while there can be a kthread structure, its task may not be running, eg. before first service is added or while the sysctl var is set to an empty cpulist or when run_estimation is set to 0 to disable the estimation. - the allocated kthread context may grow from 1 to 50 allocated structures for timer ticks which saves memory for setups with small number of estimators - a task and its structure may be released if all estimators are unlinked from its chains, leaving the slot in the array empty - every kthread data structure allows limited number of estimators. Kthread 0 is also used to initially calculate the max number of estimators to allow in every chain considering a sub-100 microsecond cond_resched rate. This number can be from 1 to hundreds. - kthread 0 has an additional job of optimizing the adding of estimators: they are first added in temp list (est_temp_list) and later kthread 0 distributes them to other kthreads. The optimization is based on the fact that newly added estimator should be estimated after 2 seconds, so we have the time to offload the adding to chain from controlling process to kthread 0. - to add new estimators we use the last added kthread context (est_add_ktid). The new estimators are linked to the chains just before the estimated one, based on add_row. This ensures their estimation will start after 2 seconds. If estimators are added in bursts, common case if all services and dests are initially configured, we may spread the estimators to more chains and as result, reducing the initial delay below 2 seconds. Many thanks to Jiri Wiesner for his valuable comments and for spending a lot of time reviewing and testing the changes on different platforms with 48-256 CPUs and 1-8 NUMA nodes under different cpufreq governors. The new IPVS estimators do not use workqueue infrastructure because: - The estimation can take long time when using multiple IPVS rules (eg. millions estimator structures) and especially when box has multiple CPUs due to the for_each_possible_cpu usage that expects packets from any CPU. With est_nice sysctl we have more control how to prioritize the estimation kthreads compared to other processes/kthreads that have latency requirements (such as servers). As a benefit, we can see these kthreads in top and decide if we will need some further control to limit their CPU usage (max number of structure to estimate per kthread). - with kthreads we run code that is read-mostly, no write/lock operations to process the estimators in 2-second intervals. - work items are one-shot: as estimators are processed every 2 seconds, they need to be re-added every time. This again loads the timers (add_timer) if we use delayed works, as there are no kthreads to do the timings. [1] Report from Yunhong Jiang: https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/ [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2 [3] Report from Dust: https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: ipvs: run_estimation should control the kthread tasks ipvs: add est_cpulist and est_nice sysctl vars ipvs: use kthreads for stats estimation ipvs: use u64_stats_t for the per-cpu counters ipvs: use common functions for stats allocation ipvs: add rcu protection to stats netfilter: flowtable: add a 'default' case to flowtable datapath netfilter: conntrack: set icmpv6 redirects as RELATED netfilter: ipset: Add support for new bitmask parameter netfilter: conntrack: merge ipv4+ipv6 confirm functions netfilter: conntrack: add sctp DATA_SENT state netfilter: nft_inner: fix IS_ERR() vs NULL check ==================== Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Constify 'group' argument in br_multicast_new_port_group()Ido Schimmel2-2/+3
The 'group' argument is not modified, so mark it as 'const'. It will allow us to constify arguments of the callers of this function in future patches. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Remove redundant function argumentsIdo Schimmel1-4/+5
Drop the first three arguments and instead extract them from the MDB configuration structure. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Move checks out of critical sectionIdo Schimmel1-18/+18
The checks only require information parsed from the RTM_NEWMDB netlink message and do not rely on any state stored in the bridge driver. Therefore, there is no need to perform the checks in the critical section under the multicast lock. Move the checks out of the critical section. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Remove br_mdb_parse()Ido Schimmel1-88/+5
The parsing of the netlink messages and the validity checks are now performed in br_mdb_config_init() so we can remove br_mdb_parse(). This finally allows us to stop passing netlink attributes deep in the MDB control path and only use the MDB configuration structure. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Use MDB group key from configuration structureIdo Schimmel1-8/+7
The MDB group key (i.e., {source, destination, protocol, VID}) is currently determined under the multicast lock from the netlink attributes. Instead, use the group key from the MDB configuration structure that was prepared before acquiring the lock. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Propagate MDB configuration structure furtherIdo Schimmel1-13/+11
As an intermediate step towards only using the new MDB configuration structure, pass it further in the control path instead of passing individual attributes. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Use MDB configuration structure where possibleIdo Schimmel1-19/+15
The MDB configuration structure (i.e., struct br_mdb_config) now includes all the necessary information from the parsed RTM_{NEW,DEL}MDB netlink messages, so use it. This will later allow us to delete the calls to br_mdb_parse() from br_mdb_add() and br_mdb_del(). No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Remove redundant checksIdo Schimmel1-54/+9
These checks are now redundant as they are performed by br_mdb_config_init() while parsing the RTM_{NEW,DEL}MDB messages. Remove them. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08bridge: mcast: Centralize netlink attribute parsingIdo Schimmel2-0/+127
Netlink attributes are currently passed deep in the MDB creation call chain, making it difficult to add new attributes. In addition, some validity checks are performed under the multicast lock although they can be performed before it is ever acquired. As a first step towards solving these issues, parse the RTM_{NEW,DEL}MDB messages into a configuration structure, relieving other functions from the need to handle raw netlink attributes. Subsequent patches will convert the MDB code to use this configuration structure. This is consistent with how other rtnetlink objects are handled, such as routes and nexthops. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30netfilter: conntrack: merge ipv4+ipv6 confirm functionsFlorian Westphal1-31/+1
No need to have distinct functions. After merge, ipv6 can avoid protooff computation if the connection neither needs sequence adjustment nor helper invocation -- this is the normal case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-22kobject: make kobject_get_ownership() take a constant kobject *Greg Kroah-Hartman1-1/+1
The call, kobject_get_ownership(), does not modify the kobject passed into it, so make it const. This propagates down into the kobj_type function callbacks so make the kobject passed into them also const, ensuring that nothing in the kobject is being changed here. This helps make it more obvious what calls and callbacks do, and do not, modify structures passed to them. Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: linux-nfs@vger.kernel.org Cc: bridge@lists.linux-foundation.org Cc: netdev@vger.kernel.org Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20221121094649.1556002-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+14
include/linux/bpf.h 1f6e04a1c7b8 ("bpf: Fix offset calculation error in __copy_map_value and zero_map_value") aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record") f71b2f64177a ("bpf: Refactor map->off_arr handling") https://lore.kernel.org/all/20221114095000.67a73239@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15bridge: switchdev: Fix memory leaks when changing VLAN protocolIdo Schimmel1-3/+14
The bridge driver can offload VLANs to the underlying hardware either via switchdev or the 8021q driver. When the former is used, the VLAN is marked in the bridge driver with the 'BR_VLFLAG_ADDED_BY_SWITCHDEV' private flag. To avoid the memory leaks mentioned in the cited commit, the bridge driver will try to delete a VLAN via the 8021q driver if the VLAN is not marked with the previously mentioned flag. When the VLAN protocol of the bridge changes, switchdev drivers are notified via the 'SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL' attribute, but the 8021q driver is also called to add the existing VLANs with the new protocol and delete them with the old protocol. In case the VLANs were offloaded via switchdev, the above behavior is both redundant and buggy. Redundant because the VLANs are already programmed in hardware and drivers that support VLAN protocol change (currently only mlx5) change the protocol upon the switchdev attribute notification. Buggy because the 8021q driver is called despite these VLANs being marked with 'BR_VLFLAG_ADDED_BY_SWITCHDEV'. This leads to memory leaks [1] when the VLANs are deleted. Fix by not calling the 8021q driver for VLANs that were already programmed via switchdev. [1] unreferenced object 0xffff8881f6771200 (size 256): comm "ip", pid 446855, jiffies 4298238841 (age 55.240s) hex dump (first 32 bytes): 00 00 7f 0e 83 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000012819ac>] vlan_vid_add+0x437/0x750 [<00000000f2281fad>] __br_vlan_set_proto+0x289/0x920 [<000000000632b56f>] br_changelink+0x3d6/0x13f0 [<0000000089d25f04>] __rtnl_newlink+0x8ae/0x14c0 [<00000000f6276baf>] rtnl_newlink+0x5f/0x90 [<00000000746dc902>] rtnetlink_rcv_msg+0x336/0xa00 [<000000001c2241c0>] netlink_rcv_skb+0x11d/0x340 [<0000000010588814>] netlink_unicast+0x438/0x710 [<00000000e1a4cd5c>] netlink_sendmsg+0x788/0xc40 [<00000000e8992d4e>] sock_sendmsg+0xb0/0xe0 [<00000000621b8f91>] ____sys_sendmsg+0x4ff/0x6d0 [<000000000ea26996>] ___sys_sendmsg+0x12e/0x1b0 [<00000000684f7e25>] __sys_sendmsg+0xab/0x130 [<000000004538b104>] do_syscall_64+0x3d/0x90 [<0000000091ed9678>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 279737939a81 ("net: bridge: Fix VLANs memory leak") Reported-by: Vlad Buslov <vladbu@nvidia.com> Tested-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20221114084509.860831-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-12bridge: Add missing parenthesesIdo Schimmel1-1/+1
No changes in generated code. Reported-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20221110085422.521059-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10bridge: switchdev: Reflect MAB bridge port flag to device driversIdo Schimmel1-1/+1
Reflect the 'BR_PORT_MAB' flag to device drivers so that: * Drivers that support MAB could act upon the flag being toggled. * Drivers that do not support MAB will prevent MAB from being enabled. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10bridge: switchdev: Allow device drivers to install locked FDB entriesHans J. Schultz4-4/+27
When the bridge is offloaded to hardware, FDB entries are learned and aged-out by the hardware. Some device drivers synchronize the hardware and software FDBs by generating switchdev events towards the bridge. When a port is locked, the hardware must not learn autonomously, as otherwise any host will blindly gain authorization. Instead, the hardware should generate events regarding hosts that are trying to gain authorization and their MAC addresses should be notified by the device driver as locked FDB entries towards the bridge driver. Allow device drivers to notify the bridge driver about such entries by extending the 'switchdev_notifier_fdb_info' structure with the 'locked' bit. The bit can only be set by device drivers and not by the bridge driver. Prevent a locked entry from being installed if MAB is not enabled on the bridge port. If an entry already exists in the bridge driver, reject the locked entry if the current entry does not have the "locked" flag set or if it points to a different port. The same semantics are implemented in the software data path. Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10bridge: switchdev: Let device drivers determine FDB offload indicationIdo Schimmel1-1/+1
Currently, FDB entries that are notified to the bridge via 'SWITCHDEV_FDB_ADD_TO_BRIDGE' are always marked as offloaded. With MAB enabled, this will no longer be universally true. Device drivers will report locked FDB entries to the bridge to let it know that the corresponding hosts required authorization, but it does not mean that these entries are necessarily programmed in the underlying hardware. Solve this by determining the offload indication based of the 'offloaded' bit in the FDB notification. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-04bridge: Add MAC Authentication Bypass (MAB) supportHans J. Schultz4-4/+65
Hosts that support 802.1X authentication are able to authenticate themselves by exchanging EAPOL frames with an authenticator (Ethernet bridge, in this case) and an authentication server. Access to the network is only granted by the authenticator to successfully authenticated hosts. The above is implemented in the bridge using the "locked" bridge port option. When enabled, link-local frames (e.g., EAPOL) can be locally received by the bridge, but all other frames are dropped unless the host is authenticated. That is, unless the user space control plane installed an FDB entry according to which the source address of the frame is located behind the locked ingress port. The entry can be dynamic, in which case learning needs to be enabled so that the entry will be refreshed by incoming traffic. There are deployments in which not all the devices connected to the authenticator (the bridge) support 802.1X. Such devices can include printers and cameras. One option to support such deployments is to unlock the bridge ports connecting these devices, but a slightly more secure option is to use MAB. When MAB is enabled, the MAC address of the connected device is used as the user name and password for the authentication. For MAB to work, the user space control plane needs to be notified about MAC addresses that are trying to gain access so that they will be compared against an allow list. This can be implemented via the regular learning process with the sole difference that learned FDB entries are installed with a new "locked" flag indicating that the entry cannot be used to authenticate the device. The flag cannot be set by user space, but user space can clear the flag by replacing the entry, thereby authenticating the device. Locked FDB entries implement the following semantics with regards to roaming, aging and forwarding: 1. Roaming: Locked FDB entries can roam to unlocked (authorized) ports, in which case the "locked" flag is cleared. FDB entries cannot roam to locked ports regardless of MAB being enabled or not. Therefore, locked FDB entries are only created if an FDB entry with the given {MAC, VID} does not already exist. This behavior prevents unauthenticated devices from disrupting traffic destined to already authenticated devices. 2. Aging: Locked FDB entries age and refresh by incoming traffic like regular entries. 3. Forwarding: Locked FDB entries forward traffic like regular entries. If user space detects an unauthorized MAC behind a locked port and wishes to prevent traffic with this MAC DA from reaching the host, it can do so using tc or a different mechanism. Enable the above behavior using a new bridge port option called "mab". It can only be enabled on a bridge port that is both locked and has learning enabled. Locked FDB entries are flushed from the port once MAB is disabled. A new option is added because there are pure 802.1X deployments that are not interested in notifications about locked FDB entries. Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+2
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03bridge: Fix flushing of dynamic FDB entriesIdo Schimmel2-2/+2
The following commands should result in all the dynamic FDB entries being flushed, but instead all the non-local (non-permanent) entries are flushed: # bridge fdb add 00:aa:bb:cc:dd:ee dev dummy1 master static # bridge fdb add 00:11:22:33:44:55 dev dummy1 master dynamic # ip link set dev br0 type bridge fdb_flush # bridge fdb show brport dummy1 00:00:00:00:00:01 master br0 permanent 33:33:00:00:00:01 self permanent 01:00:5e:00:00:01 self permanent This is because br_fdb_flush() works with FDB flags and not the corresponding enumerator values. Fix by passing the FDB flag instead. After the fix: # bridge fdb add 00:aa:bb:cc:dd:ee dev dummy1 master static # bridge fdb add 00:11:22:33:44:55 dev dummy1 master dynamic # ip link set dev br0 type bridge fdb_flush # bridge fdb show brport dummy1 00:aa:bb:cc:dd:ee master br0 static 00:00:00:00:00:01 master br0 permanent 33:33:00:00:00:01 self permanent 01:00:5e:00:00:01 self permanent Fixes: 1f78ee14eeac ("net: bridge: fdb: add support for fine-grained flushing") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20221101185753.2120691-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-29net: Remove the obsolte u64_stats_fetch_*_irq() users (net).Thomas Gleixner2-4/+4
Now that the 32bit UP oddity is gone and 32bit uses always a sequence count, there is no need for the fetch_irq() variants anymore. Convert to the regular interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-19bridge: mcast: Simplify MDB entry creationIdo Schimmel1-8/+3
Before creating a new MDB entry, br_multicast_new_group() will call br_mdb_ip_get() to see if one exists and return it if so. Therefore, simply call br_multicast_new_group() and omit the call to br_mdb_ip_get(). Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-19bridge: mcast: Use spin_lock() instead of spin_lock_bh()Ido Schimmel1-4/+4
IGMPv3 / MLDv2 Membership Reports are only processed from the data path with softIRQ disabled, so there is no need to call spin_lock_bh(). Use spin_lock() instead. This is consistent with how other IGMP / MLD packets are processed. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: bridge: assign path_cost for 2.5G and 5G link speedSteven Hsieh1-1/+10
As 2.5G, 5G ethernet ports are more common and affordable, these ports are being used in LAN bridge devices. STP port_cost() is missing path_cost assignment for these link speeds, causes highest cost 100 being used. This result in lower speed port being picked when there is loop between 5G and 1G ports. Original path_cost: 10G=2, 1G=4, 100m=19, 10m=100 Adjusted path_cost: 10G=2, 5G=3, 2.5G=4, 1G=5, 100m=19, 10m=100 speed greater than 10G = 1 Signed-off-by: Steven Hsieh <steven.hsieh@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+3
drivers/net/ethernet/freescale/fec.h 7b15515fc1ca ("Revert "fec: Restart PPS after link state change"") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c c297561bc98a ("pinctrl: ocelot: Fix interrupt controller") 181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile bbb774d921e2 ("net: Add tests for bonding and team address list management") 152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c 5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21netfilter: ebtables: fix memory leak when blob is malformedFlorian Westphal1-1/+3
The bug fix was incomplete, it "replaced" crash with a memory leak. The old code had an assignment to "ret" embedded into the conditional, restore this. Fixes: 7997eff82828 ("netfilter: ebtables: reject blobs that don't provide all entry points") Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni2-0/+3
drivers/net/ethernet/freescale/fec.h 7d650df99d52 ("net: fec: add pm_qos support on imx6q platform") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-31netfilter: br_netfilter: Drop dst references before setting.Harsh Modi2-0/+3
The IPv6 path already drops dst in the daddr changed case, but the IPv4 path does not. This change makes the two code paths consistent. Further, it is possible that there is already a metadata_dst allocated from ingress that might already be attached to skbuff->dst while following the bridge path. If it is not released before setting a new metadata_dst, it will be leaked. This is similar to what is done in bpf_set_tunnel_key() or ip6_route_input(). It is important to note that the memory being leaked is not the dst being set in the bridge code, but rather memory allocated from some other code path that is not being freed correctly before the skb dst is overwritten. An example of the leakage fixed by this commit found using kmemleak: unreferenced object 0xffff888010112b00 (size 256): comm "softirq", pid 0, jiffies 4294762496 (age 32.012s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 16 f1 83 ff ff ff ff ................ e1 4e f6 82 ff ff ff ff 00 00 00 00 00 00 00 00 .N.............. backtrace: [<00000000d79567ea>] metadata_dst_alloc+0x1b/0xe0 [<00000000be113e13>] udp_tun_rx_dst+0x174/0x1f0 [<00000000a36848f4>] geneve_udp_encap_recv+0x350/0x7b0 [<00000000d4afb476>] udp_queue_rcv_one_skb+0x380/0x560 [<00000000ac064aea>] udp_unicast_rcv_skb+0x75/0x90 [<000000009a8ee8c5>] ip_protocol_deliver_rcu+0xd8/0x230 [<00000000ef4980bb>] ip_local_deliver_finish+0x7a/0xa0 [<00000000d7533c8c>] __netif_receive_skb_one_core+0x89/0xa0 [<00000000a879497d>] process_backlog+0x93/0x190 [<00000000e41ade9f>] __napi_poll+0x28/0x170 [<00000000b4c0906b>] net_rx_action+0x14f/0x2a0 [<00000000b20dd5d4>] __do_softirq+0xf4/0x305 [<000000003a7d7e15>] __irq_exit_rcu+0xc3/0x140 [<00000000968d39a2>] sysvec_apic_timer_interrupt+0x9e/0xc0 [<000000009e920794>] asm_sysvec_apic_timer_interrupt+0x16/0x20 [<000000008942add0>] native_safe_halt+0x13/0x20 Florian Westphal says: "Original code was likely fine because nothing ever did set a skb->dst entry earlier than bridge in those days." Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Harsh Modi <harshmodi@google.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-31/+1
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 21234e3a84c7 ("net/mlx5e: Fix use after free in mlx5e_fs_init()") c7eafc5ed068 ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer") https://lore.kernel.org/all/20220825104410.67d4709c@canb.auug.org.au/ https://lore.kernel.org/all/20220823055533.334471-1-saeed@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23netfilter: ebtables: reject blobs that don't provide all entry pointsFlorian Westphal4-31/+1
Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0 For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table. t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for. Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23net: bridge: move DSA master bridging restriction to DSAVladimir Oltean1-20/+0
When DSA gains support for multiple CPU ports in a LAG, it will become mandatory to monitor the changeupper events for the DSA master. In fact, there are already some restrictions to be imposed in that area, namely that a DSA master cannot be a bridge port except in some special circumstances. Centralize the restrictions at the level of the DSA layer as a preliminary step. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23bridge: move from strlcpy with unused retval to strscpyWolfram Sang3-7/+7
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220818210212.8347-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+6
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26bridge: Do not send empty IFLA_AF_SPEC attributeBenjamin Poirier1-2/+6
After commit b6c02ef54913 ("bridge: Netlink interface fix."), br_fill_ifinfo() started to send an empty IFLA_AF_SPEC attribute when a bridge vlan dump is requested but an interface does not have any vlans configured. iproute2 ignores such an empty attribute since commit b262a9becbcb ("bridge: Fix output with empty vlan lists") but older iproute2 versions as well as other utilities have their output changed by the cited kernel commit, resulting in failed test cases. Regardless, emitting an empty attribute is pointless and inefficient. Avoid this change by canceling the attribute if no AF_SPEC data was added. Fixes: b6c02ef54913 ("bridge: Netlink interface fix.") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220725001236.95062-1-bpoirier@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-07-11netfilter: nf_tables: add and use BE register load-store helpersFlorian Westphal1-1/+1
Same as the existing ones, no conversions. This is just for sparse sake only so that we no longer mix be16/u16 and be32/u32 types. Alternative is to add __force __beX in various places, but this seems nicer. objdiff shows no changes. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-07-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+18
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c 9c5de246c1db ("net: sparx5: mdb add/del handle non-sparx5 devices") fbb89d02e33a ("net: sparx5: Allow mdb entries to both CPU and ports") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-27netfilter: br_netfilter: do not skip all hooks with 0 priorityFlorian Westphal1-3/+18
When br_netfilter module is loaded, skbs may be diverted to the ipv4/ipv6 hooks, just like as if we were routing. Unfortunately, bridge filter hooks with priority 0 may be skipped in this case. Example: 1. an nftables bridge ruleset is loaded, with a prerouting hook that has priority 0. 2. interface is added to the bridge. 3. no tcp packet is ever seen by the bridge prerouting hook. 4. flush the ruleset 5. load the bridge ruleset again. 6. tcp packets are processed as expected. After 1) the only registered hook is the bridge prerouting hook, but its not called yet because the bridge hasn't been brought up yet. After 2), hook order is: 0 br_nf_pre_routing // br_netfilter internal hook 0 chain bridge f prerouting // nftables bridge ruleset The packet is diverted to br_nf_pre_routing. If call-iptables is off, the nftables bridge ruleset is called as expected. But if its enabled, br_nf_hook_thresh() will skip it because it assumes that all 0-priority hooks had been called previously in bridge context. To avoid this, check for the br_nf_pre_routing hook itself, we need to resume directly after it, even if this hook has a priority of 0. Unfortunately, this still results in different packet flow. With this fix, the eval order after in 3) is: 1. br_nf_pre_routing 2. ip(6)tables (if enabled) 3. nftables bridge but after 5 its the much saner: 1. nftables bridge 2. br_nf_pre_routing 3. ip(6)tables (if enabled) Unfortunately I don't see a solution here: It would be possible to move br_nf_pre_routing to a higher priority so that it will be called later in the pipeline, but this also impacts ebtables evaluation order, and would still result in this very ordering problem for all nftables-bridge hooks with the same priority as the br_nf_pre_routing one. Searching back through the git history I don't think this has ever behaved in any other way, hence, no fixes-tag. Reported-by: Radim Hrazdil <rhrazdil@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-06-15net: bridge: allow add/remove permanent mdb entries on disabled portsCasper Andersson1-6/+9
Adding mdb entries on disabled ports allows you to do setup before accepting any traffic, avoiding any time where the port is not in the multicast group. Signed-off-by: Casper Andersson <casper.casan@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-10net: adopt u64_stats_t in struct pcpu_sw_netstatsEric Dumazet2-20/+24
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: rename reference+tracking helpersJakub Kicinski1-5/+5
Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+7
drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.Andrew Lunn1-0/+7
It is possible to stack bridges on top of each other. Consider the following which makes use of an Ethernet switch: br1 / \ / \ / \ br0.11 wlan0 | br0 / | \ p1 p2 p3 br0 is offloaded to the switch. Above br0 is a vlan interface, for vlan 11. This vlan interface is then a slave of br1. br1 also has a wireless interface as a slave. This setup trunks wireless lan traffic over the copper network inside a VLAN. A frame received on p1 which is passed up to the bridge has the skb->offload_fwd_mark flag set to true, indicating that the switch has dealt with forwarding the frame out ports p2 and p3 as needed. This flag instructs the software bridge it does not need to pass the frame back down again. However, the flag is not getting reset when the frame is passed upwards. As a result br1 sees the flag, wrongly interprets it, and fails to forward the frame to wlan0. When passing a frame upwards, clear the flag. This is the Rx equivalent of br_switchdev_frame_unmark() in br_dev_xmit(). Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-09rtnetlink: add extack support in fdb del handlersAlaa Mohamed2-2/+4
Add extack support to .ndo_fdb_del in netdevice.h and all related methods. Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-06net: make drivers set the TSO limit not the GSO limitJakub Kicinski1-6/+6
Drivers should call the TSO setting helper, GSO is controllable by user space. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+2
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>