Age | Commit message (Collapse) | Author | Files | Lines |
|
Jonathan reports crashes when running net-next in Meta's fleet.
Stats collection uses ethtool -I which does a per-op policy dump
to check if stats are supported. We don't initialize the dumpit
information if doit succeeds due to evaluation short-circuiting.
The crash may look like this:
BUG: kernel NULL pointer dereference, address: 0000000000000cc0
RIP: 0010:netlink_policy_dump_add_policy+0x174/0x2a0
ctrl_dumppolicy_start+0x19f/0x2f0
genl_start+0xe7/0x140
Or we may trigger a warning:
WARNING: CPU: 1 PID: 785 at net/netlink/policy.c:87 netlink_policy_dump_get_policy_idx+0x79/0x80
RIP: 0010:netlink_policy_dump_get_policy_idx+0x79/0x80
ctrl_dumppolicy_put_op+0x214/0x360
depending on what garbage we pick up from the stack.
Reported-by: Jonathan Lemon <bsd@meta.com>
Fixes: 26588edbef60 ("genetlink: support split policies in ctrl_dumppolicy_put_op()")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20221109183254.554051-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a devlink port is unregistered, its type is expected to be unset or
otherwise a WARNING is generated [1]. This was supposed to be handled by
cited commit by clearing the type upon 'NETDEV_PRE_UNINIT'.
The assumption was that no other events can be generated for the netdev
after this event, but this proved to be wrong. After the event is
generated, netdev_wait_allrefs_any() will rebroadcast a
'NETDEV_UNREGISTER' until the netdev's reference count drops to 1. This
causes devlink to set the port type back to Ethernet.
Fix by only setting and clearing the port type upon 'NETDEV_POST_INIT'
and 'NETDEV_PRE_UNINIT', respectively. For all other events, preserve
the port type.
[1]
WARNING: CPU: 0 PID: 11 at net/core/devlink.c:9998 devl_port_unregister+0x2f6/0x390 net/core/devlink.c:9998
Modules linked in:
CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc3-next-20221107-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: netns cleanup_net
RIP: 0010:devl_port_unregister+0x2f6/0x390 net/core/devlink.c:9998
[...]
Call Trace:
<TASK>
__nsim_dev_port_del+0x1bb/0x240 drivers/net/netdevsim/dev.c:1433
nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1443 [inline]
nsim_dev_reload_destroy+0x171/0x510 drivers/net/netdevsim/dev.c:1660
nsim_dev_reload_down+0x6b/0xd0 drivers/net/netdevsim/dev.c:968
devlink_reload+0x1c2/0x6b0 net/core/devlink.c:4501
devlink_pernet_pre_exit+0x104/0x1c0 net/core/devlink.c:12609
ops_pre_exit_list net/core/net_namespace.c:159 [inline]
cleanup_net+0x451/0xb10 net/core/net_namespace.c:594
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Fixes: 02a68a47eade ("net: devlink: track netdev with devlink_port assigned")
Reported-by: syzbot+85e47e1a08b3e159b159@syzkaller.appspotmail.com
Reported-by: syzbot+c2ca18f0fccdd1f09c66@syzkaller.appspotmail.com
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221110085150.520800-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Long Li says:
====================
Introduce Microsoft Azure Network Adapter (MANA) RDMA driver [netdev prep]
The first 11 patches which modify the MANA Ethernet driver to support
RDMA driver.
* 'mana-shared-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
net: mana: Define data structures for protection domain and memory registration
net: mana: Define data structures for allocating doorbell page from GDMA
net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
net: mana: Define max values for SGL entries
net: mana: Move header files to a common location
net: mana: Record port number in netdev
net: mana: Export Work Queue functions for use by RDMA driver
net: mana: Set the DMA device max segment size
net: mana: Handle vport sharing between devices
net: mana: Record the physical address for doorbell page region
net: mana: Add support for auxiliary device
====================
Link: https://lore.kernel.org/all/1667502990-2559-1-git-send-email-longli@linuxonhyperv.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If ethtool_ops::get_drvinfo() callback isn't set,
ethtool_get_drvinfo() will fill the ethtool_drvinfo::name and
ethtool_drvinfo::bus_info fields.
However, if the driver provides the callback function, those two
fields are not touched. This means that the driver has to fill these
itself.
Allow the driver to leave those two fields empty and populate them in
such case. This way, the driver can rely on the default values for the
name and the bus_info. If the driver provides values, do nothing.
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/r/20221108035754.2143-1-mailhol.vincent@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The ndo_start_xmit field in net_device_ops is expected to be of type
netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb, struct net_device *dev).
The mismatched return type breaks forward edge kCFI since the underlying
function definition does not match the function hook definition. A new
warning in clang will catch this at compile time:
drivers/net/ethernet/microsoft/mana/mana_en.c:382:21: error: incompatible function pointer types initializing 'netdev_tx_t (*)(struct sk_buff *, struct net_device *)' (aka 'enum netdev_tx (*)(struct sk_buff *, struct net_device *)') with an expression of type 'int (struct sk_buff *, struct net_device *)' [-Werror,-Wincompatible-function-pointer-types-strict]
.ndo_start_xmit = mana_start_xmit,
^~~~~~~~~~~~~~~
1 error generated.
The return type of mana_start_xmit should be changed from int to
netdev_tx_t.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1703
Link: https://github.com/ClangBuiltLinux/linux/issues/1750
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
[nathan: Rebase on net-next and resolve conflicts
Add note about new clang warning]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20221109002629.1446680-1-nathan@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After searching for a protocol handler in dev_gro_receive, checking for
failure is redundant. Skip the failure code after finding the
corresponding handler.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221108123320.GA59373@debian
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-12-git-send-email-longli@linuxonhyperv.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
The RDMA device needs to allocate doorbell pages for each user context.
Define the GDMA data structures for use by the RDMA driver.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-11-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-10-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type and the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-9-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-8-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
The port number is useful for user-mode application to identify this
net device based on port index. Set to the correct value in ndev.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-7-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
RDMA device may need to create Ethernet device queues for use by Queue
Pair type RAW. This allows a user-mode context accesses Ethernet hardware
queues. Export the supporting functions for use by the RDMA driver.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-6-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
MANA hardware doesn't have any restrictions on the DMA segment size, set it
to the max allowed value.
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-5-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
For outgoing packets, the PF requires the VF to configure the vport with
corresponding protection domain and doorbell ID for the kernel or user
context. The vport can't be shared between different contexts.
Implement the logic to exclusively take over the vport by either the
Ethernet device or RDMA device.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-4-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
For supporting RDMA device with multiple user contexts with their
individual doorbell pages, record the start address of doorbell page
region for use by the RDMA driver to allocate user context doorbell IDs.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-3-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
In preparation for supporting MANA RDMA driver, add support for auxiliary
device in the Ethernet driver. The RDMA device is modeled as an auxiliary
device to the Ethernet device.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-2-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Russell King says:
====================
Clean up pcs-xpcs accessors
This series cleans up the pcs-xpcs code to use mdiodev accessors for
read/write just like xpcs_modify_changed() does. In order to do this,
we need to introduce the mdiodev clause 45 accessors.
====================
Link: https://lore.kernel.org/r/Y2pm13+SDg6N/IVx@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use mdiodev accessors rather than accessing the bus and address in
the mdio_device structure and using the mdiobus accessors.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The strlen() may go too far when estimating the length of
the given string. In some cases it may go over the boundary
and crash the system which is the case according to the commit
13a55372b64e ("ARM: orion5x: Revert commit 4904dbda41c8.").
Rectify this by switching to strnlen() for the expected
maximum length of the string.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20221108141108.62974-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The existing genphy_loopback() is not working for TI DP83867 PHY as it
will disable autoneg support while another side is still enabling autoneg.
This is causing the link is not established and results in timeout error
in genphy_loopback() function.
Thus, based on TI PHY datasheet, introduce a TI PHY loopback function by
just configuring BMCR_LOOPBACK(Bit-9) in MII_BMCR register (0x0).
Tested working on TI DP83867 PHY for all speeds (10/100/1000Mbps).
Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221108101527.612723-1-michael.wei.hong.sit@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Raju Lakkaraju says:
====================
net: lan743x: PCI11010 / PCI11414 devices Enhancements
This patch series continues with the addition of supported features for the
Ethernet function of the PCI11010 / PCI11414 devices to the LAN743x driver.
====================
Link: https://lore.kernel.org/r/20221107085650.991470-1-Raju.Lakkaraju@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support to SGMII register dump
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the unused argument (i.e. struct ethtool_regs *regs) in
lan743x_common_regs( ) function arguments.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Petr Machata says:
====================
mlxsw: Add 802.1X and MAB offload support
This patchset adds 802.1X [1] and MAB [2] offload support in mlxsw.
Patches #1-#3 add the required switchdev interfaces.
Patches #4-#5 add the required packet traps for 802.1X.
Patches #6-#10 are small preparations in mlxsw.
Patch #11 adds locked bridge port support in mlxsw.
Patches #12-#15 add mlxsw selftests. The patchset was also tested with
the generic forwarding selftest ('bridge_locked_port.sh').
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a21d9a670d81103db7f788de1a4a4a6e4b891a0b
[2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a35ec8e38cdd1766f29924ca391a01de20163931
====================
Link: https://lore.kernel.org/r/cover.1667902754.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Test that locked bridge port configurations that are not supported by
mlxsw are rejected.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Test that packets received via a locked bridge port whose {SMAC, VID}
does not appear in the bridge's FDB or appears with a different port,
trigger the "locked_port" packet trap.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Test that packets with a destination MAC of 01:80:C2:00:00:03 trigger
the "eapol" packet trap.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Merely checking whether a trap counter incremented or not without
logging a test result is useful on its own. Split this functionality to
a helper which will be used by subsequent patches.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add locked bridge port support by reacting to changes in the
'BR_PORT_LOCKED' flag. When set, enable security checks on the local
port via the previously added SPFSR register.
When security checks are enabled, an incoming packet will trigger an FDB
lookup with the packet's source MAC and the FID it was classified to. If
an FDB entry was not found or was found to be pointing to a different
port, the packet will be dropped. Such packets increment the
"discard_ingress_general" ethtool counter. For added visibility, user
space can trap such packets to the CPU by enabling the "locked_port"
trap. Example:
# devlink trap set pci/0000:06:00.0 trap locked_port action trap
Unlike other configurations done via bridge port flags (e.g., learning,
flooding), security checks are enabled in the device on a per-port basis
and not on a per-{port, VLAN} basis. As such, scenarios where user space
can configure different locking settings for different VLANs configured
on a port need to be vetoed. To that end, veto the following scenarios:
1. Locking is set on a bridge port that is a VLAN upper
2. Locking is set on a bridge port that has VLAN uppers
3. VLAN upper is configured on a locked bridge port
Examples:
# bridge link set dev swp1.10 locked on
Error: mlxsw_spectrum: Locked flag cannot be set on a VLAN upper.
# ip link add link swp1 name swp1.10 type vlan id 10
# bridge link set dev swp1 locked on
Error: mlxsw_spectrum: Locked flag cannot be set on a bridge port that has VLAN uppers.
# bridge link set dev swp1 locked on
# ip link add link swp1 name swp1.10 type vlan id 10
Error: mlxsw_spectrum: VLAN uppers are not supported on a locked port.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Propagate extack to mlxsw_sp_port_attr_br_pre_flags_set() in order to
communicate error messages related to bridge port flag validation.
Example:
# bridge link set dev swp1 locked on
Error: mlxsw_spectrum: Unsupported bridge port flag.
More error messages will be added in subsequent patches.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In Spectrum, learning happens in parallel to the security checks.
Therefore, regardless of the result of the security checks, a learning
notification will be generated by the device and polled later on by the
driver.
Currently, the driver reacts to learning notifications by programming
corresponding FDB entries to the device. When a port is locked (i.e.,
has security checks enabled), this can no longer happen, as otherwise
any host will blindly gain authorization.
Instead, notify the learned entry as a locked entry to the bridge driver
that will in turn notify it to user space, in case MAB is enabled. User
space can then decide to authorize the host by clearing the "locked"
flag, which will cause the entry to be programmed to the device.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Subsequent patches will need to report locked FDB entries to the bridge
driver. Prepare for that by adding a 'locked' argument to
mlxsw_sp_fdb_call_notifiers() according to which the 'locked' bit is set
in the FDB notification info. For now, always pass 'false'.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add an API to enable or disable security checks on a local port. It will
be used by subsequent patches when the 'BR_PORT_LOCKED' flag is toggled.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add the Switch Port FDB Security Register (SPFSR) that allows enabling
and disabling security checks on a given local port. In Linux terms, it
allows locking / unlocking a port.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Register the previously added packet traps with devlink. This allows
user space to tune their policers and in the case of the locked port
trap, user space can set its action to "trap" in order to gain
visibility into packets that were discarded by the device due to the
locked port check failure.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add packet traps for 802.1X operation. The "eapol" control trap is used
to trap EAPOL packets and is required for the correct operation of the
control plane. The "locked_port" drop trap can be enabled to gain
visibility into packets that were dropped by the device due to the
locked bridge port check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Reflect the 'BR_PORT_MAB' flag to device drivers so that:
* Drivers that support MAB could act upon the flag being toggled.
* Drivers that do not support MAB will prevent MAB from being enabled.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the bridge is offloaded to hardware, FDB entries are learned and
aged-out by the hardware. Some device drivers synchronize the hardware
and software FDBs by generating switchdev events towards the bridge.
When a port is locked, the hardware must not learn autonomously, as
otherwise any host will blindly gain authorization. Instead, the
hardware should generate events regarding hosts that are trying to gain
authorization and their MAC addresses should be notified by the device
driver as locked FDB entries towards the bridge driver.
Allow device drivers to notify the bridge driver about such entries by
extending the 'switchdev_notifier_fdb_info' structure with the 'locked'
bit. The bit can only be set by device drivers and not by the bridge
driver.
Prevent a locked entry from being installed if MAB is not enabled on the
bridge port.
If an entry already exists in the bridge driver, reject the locked entry
if the current entry does not have the "locked" flag set or if it points
to a different port. The same semantics are implemented in the software
data path.
Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, FDB entries that are notified to the bridge via
'SWITCHDEV_FDB_ADD_TO_BRIDGE' are always marked as offloaded. With MAB
enabled, this will no longer be universally true. Device drivers will
report locked FDB entries to the bridge to let it know that the
corresponding hosts required authorization, but it does not mean that
these entries are necessarily programmed in the underlying hardware.
Solve this by determining the offload indication based of the
'offloaded' bit in the FDB notification.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'net-devlink-move-netdev-notifier-block-to-dest-namespace-during-reload'
Jiri Pirko says:
====================
net: devlink: move netdev notifier block to dest namespace during reload
Patch #1 is just a dependency of patch #2, which is the actual fix.
====================
Link: https://lore.kernel.org/r/20221108132208.938676-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The notifier block tracking netdev changes in devlink is registered
during devlink_alloc() per-net, it is then unregistered
in devlink_free(). When devlink moves from net namespace to another one,
the notifier block needs to move along.
Fix this by adding forgotten call to move the block.
Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 02a68a47eade ("net: devlink: track netdev with devlink_port assigned")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, net_dev() netdev notifier variant follows the netdev with
per-net notifier from namespace to namespace. This is implemented
by move_netdevice_notifiers_dev_net() helper.
For devlink it is needed to re-register per-net notifier during
devlink reload. Introduce a new helper called
move_netdevice_notifier_net() and share the unregister/register code
with existing move_netdevice_notifiers_dev_net() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The return value from genl_op_iter_init() only tells us if
there are any policies but to begin the iteration (and therefore
load the first entry) we need to call genl_op_iter_next().
Note that it's safe to call genl_op_iter_next() on a family
with no ops, it will just return false.
This may lead to various crashes, a warning in
netlink_policy_dump_get_policy_idx() when policy is not found
or.. no problem at all if the kmalloc'ed memory happens to be
zeroed.
Fixes: b502b3185cd6 ("genetlink: use iterator in the op to policy map dumping")
Link: https://lore.kernel.org/r/20221108204128.330287-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
rxrpc changes
David Howells says:
====================
rxrpc: Increasing SACK size and moving away from softirq, part 1
AF_RXRPC has some issues that need addressing:
(1) The SACK table has a maximum capacity of 255, but for modern networks
that isn't sufficient. This is hard to increase in the upstream code
because of the way the application thread is coupled to the softirq
and retransmission side through a ring buffer. Adjustments to the rx
protocol allows a capacity of up to 8192, and having a ring
sufficiently large to accommodate that would use an excessive amount
of memory as this is per-call.
(2) Processing ACKs in softirq mode causes the ACKs get conflated, with
only the most recent being considered. Whilst this has the upside
that the retransmission algorithm only needs to deal with the most
recent ACK, it causes DATA transmission for a call to be very bursty
because DATA packets cannot be transmitted in softirq mode. Rather
transmission must be delegated to either the application thread or a
workqueue, so there tend to be sudden bursts of traffic for any
particular call due to scheduling delays.
(3) All crypto in a single call is done in series; however, each DATA
packet is individually encrypted so encryption and decryption of large
calls could be parallelised if spare CPU resources are available.
This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:
(1) To get rid of the TxRx ring and instead pass the packets round in
queues (eg. sk_buff_head). On the Tx side, each ACK packet comes with
a SACK table that can be parsed as-is, so there's no particular need
to maintain our own; we just have to refer to the ACK.
On the Rx side, we do need to maintain a SACK table with one bit per
entry - but only if packets go missing - and we don't want to have to
perform a complex transformation to get the information into an ACK
packet.
(2) To try and move almost all processing of received packets out of the
softirq handler and into a high-priority kernel I/O thread. Only the
transferral of packets would be left there. I would still use the
encap_rcv hook to receive packets as there's a noticeable performance
drop from letting the UDP socket put the packets into its own queue
and then getting them out of there.
(3) To make the I/O thread also do all the transmission. The app thread
would be responsible for packaging the data into packets and then
buffering them for the I/O thread to transmit. This would make it
easier for the app thread to run ahead of the I/O thread, and would
mean the I/O thread is less likely to have to wait around for a new
packet to come available for transmission.
(4) To logically partition the socket/UAPI/KAPI side of things from the
I/O side of things. The local endpoint, connection, peer and call
objects would belong to the I/O side. The socket side would not then
touch the private internals of calls and suchlike and would not change
their states. It would only look at the send queue, receive queue and
a way to pass a message to cause an abort.
(5) To remove as much locking, synchronisation, barriering and atomic ops
as possible from the I/O side. Exclusion would be achieved by
limiting modification of state to the I/O thread only. Locks would
still need to be used in communication with the UDP socket and the
AF_RXRPC socket API.
(6) To provide crypto offload kernel threads that, when there's slack in
the system, can see packets that need crypting and provide
parallelisation in dealing with them.
(7) To remove the use of system timers. Since each timer would then send
a poke to the I/O thread, which would then deal with it when it had
the opportunity, there seems no point in using system timers if,
instead, a list of timeouts can be sensibly consulted. An I/O thread
only then needs to schedule with a timeout when it is idle.
(8) To use zero-copy sendmsg to send packets. This would make use of the
I/O thread being the sole transmitter on the socket to manage the
dead-reckoning sequencing of the completion notifications. There is a
problem with zero-copy, though: the UDP socket doesn't handle running
out of option memory very gracefully.
With regard to this first patchset, the changes made include:
(1) Some fixes, including a fallback for proc_create_net_single_write(),
setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
congestion management, which shouldn't be saving the cwnd value
between calls.
(2) Improvements in rxrpc tracepoints, including splitting the timer
tracepoint into a set-timer and a timer-expired trace.
(3) Addition of a new proc file to display some stats.
(4) Some code cleanups, including removing some unused bits and
unnecessary header inclusions.
(5) A change to the recently added UDP encap_err_rcv hook so that it has
the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
point its UDP socket's hook directly at those.
(6) Definition of a new struct, rxrpc_txbuf, that is used to hold
transmissible packets of DATA and ACK type in a single 2KiB block
rather than using an sk_buff. This allows the buffer to be on a
number of queues simultaneously more easily, and also guarantees that
the entire block is in a single unit for zerocopy purposes and that
the data payload is aligned for in-place crypto purposes.
(7) ACK txbufs are allocated at proposal and queued for later transmission
rather than being stored in a single place in the rxrpc_call struct,
which means only a single ACK can be pending transmission at a time.
The queue is then drained at various points. This allows the ACK
generation code to be simplified.
(8) The Rx ring buffer is removed. When a jumbo packet is received (which
comprises a number of ordinary DATA packets glued together), it used
to be pointed to by the ring multiple times, with an annotation in a
side ring indicating which subpacket was in that slot - but this is no
longer possible. Instead, the packet is cloned once for each
subpacket, barring the last, and the range of data is set in the skb
private area. This makes it easier for the subpackets in a jumbo
packet to be decrypted in parallel.
(9) The Tx ring buffer is removed. The side annotation ring that held the
SACK information is also removed. Instead, in the event of packet
loss, the SACK data attached an ACK packet is parsed.
(10) Allocate an skcipher request when needed in the rxkad security class
rather than caching one in the rxrpc_call struct. This deals with a
race between externally-driven call disconnection getting rid of the
skcipher request and sendmsg/recvmsg trying to use it because they
haven't seen the completion yet. This is also needed to support
parallelisation as the skcipher request cannot be used by two or more
threads simultaneously.
(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
through kernel_sendmsg() so that we can provide our own iterator
(zerocopy explicitly doesn't work with a KVEC iterator). This also
lets us avoid the overhead of the security hook.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow a network interface to be renamed when the interface
is up.
As described in the netconsole documentation [1], when netconsole is
used as a built-in, it will bring up the specified interface as soon as
possible. As a result, user space will not be able to rename the
interface since the kernel disallows renaming of interfaces that are
administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
by the kernel.
The original solution [2] to this problem was to add a new parameter to
the netconsole configuration parameters that allows renaming of
the interface used by netconsole while it is administratively up.
However, during the discussion that followed, it became apparent that we
have no reason to keep the current restriction and instead we should
allow user space to rename interfaces regardless of their administrative
state:
1. The restriction was put in place over 20 years ago when renaming was
only possible via IOCTL and before rtnetlink started notifying user
space about such changes like it does today.
2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
5.2 and no regressions were reported.
3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
the administrative state of interface.
Therefore, allow user space to rename running interfaces by removing the
restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
possible triage by emitting a message to the kernel log that an
interface was renamed while UP.
[1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
[2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/
Signed-off-by: Andy Ren <andy.ren@getcruise.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rakesh Sankaranarayanan says:
====================
net: dsa: microchip: ksz_pwrite status check for lan937x and irq and error checking updates for ksz series
This patch series include following changes,
- Add KSZ9563 inside ksz_switch_chips. As per current structure,
KSZ9893 is reused inside ksz_switch_chips structure, but since
there is a mismatch in number of irq's, new member added for KSZ9563
and sku detected based on Global Chip ID 4 Register. Compatible
string from device tree mapped to KSZ9563 for spi and i2c mode
probes.
- Assign device interrupt during i2c probe operation.
- Add error checking for ksz_pwrite inside lan937x_change_mtu. After v6.0,
ksz_pwrite updated to have return type int instead of void, and
lan937x_change_mtu still uses ksz_pwrite without status verification.
- Add port_nirq as 3 for KSZ8563 switch family.
- Use dev_err_probe() instead of dev_err() to have more standardized error
formatting and logging.
v1 -> v2:
- Removed regmap validation patch from the series, planning to take
up in future after checking for any better approach and studying
the actual need for this change.
- Resolved error reported in ksz8863_smi.c file.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Probe functions uses normal dev_err() to check error conditions
and print messages. Replace dev_err() with dev_err_probe() to
have more standardized format and error logging.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
KSZ8563 have three port interrupts: PTP, PHY and ACL. Add
port_nirq as 3 for KSZ8563 inside ksz_chip_data.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|