summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2021-07-20net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICEVladimir Oltean1-0/+56
Currently DSA has an issue with FDB entries pointing towards the bridge in the presence of br_fdb_replay() being called at port join and leave time. In particular, each bridge port will ask for a replay for the FDB entries pointing towards the bridge when it joins, and for another replay when it leaves. This means that for example, a bridge with 4 switch ports will notify DSA 4 times of the bridge MAC address. But if the MAC address of the bridge changes during the normal runtime of the system, the bridge notifies switchdev [ once ] of the deletion of the old MAC address as a local FDB towards the bridge, and of the insertion [ again once ] of the new MAC address as a local FDB. This is a problem, because DSA keeps the old MAC address as a host FDB entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the old MAC address will not be deleted. Additionally, the new MAC address will only be installed with refcount 1, and when the first switch port leaves the bridge (leaving 3 others as still members), it will delete with it the new MAC address of the bridge from the local FDB entries kept by DSA (because the br_fdb_replay call on deletion will bring the entry's refcount from 1 to 0). So the problem, really, is that the number of br_fdb_replay() calls is not matched with the refcount that a host FDB is offloaded to DSA during normal runtime. An elegant way to solve the problem would be to make the switchdev notification emitted by br_fdb_change_mac_address() result in a host FDB kept by DSA which has a refcount exactly equal to the number of ports under that bridge. Then, no matter how many DSA ports join or leave that bridge, the host FDB entry will always be deleted when there are exactly zero remaining DSA switch ports members of the bridge. To implement the proposed solution, we remember that the switchdev objects and port attributes have some helpers provided by switchdev, which can be optionally called by drivers: switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set. These helpers: - fan out a switchdev object/attribute emitted for the bridge towards all the lower interfaces that pass the check_cb(). - fan out a switchdev object/attribute emitted for a bridge port that is a LAG towards all the lower interfaces that pass the check_cb(). In other words, this is the model we need for the FDB events too: something that will keep an FDB entry emitted towards a physical port as it is, but translate an FDB entry emitted towards the bridge into N FDB entries, one per physical port. Of course, there are many differences between fanning out a switchdev object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards a LAG should be treated specially, because FDB entries are unicast, we can't just install the same address towards 3 destinations. It is imaginable that drivers might want to treat this case specifically, so create some methods for this case and do not recurse into the LAG lower ports, just the bridge ports. DSA also listens for FDB entries on "foreign" interfaces, aka interfaces bridged with us which are not part of our hardware domain: think an Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA installs host FDB entries. However, there we have the same problem (those host FDB entries are installed with a refcount of only 1) and an even bigger one which we did not have with FDB entries towards the bridge: br_fdb_replay() is currently not called for FDB entries on foreign interfaces, just for the physical port and for the bridge itself. So when DSA sniffs an address learned by the software bridge towards a foreign interface like an e1000 port, and then that e1000 leaves the bridge, DSA remains with the dangling host FDB address. That will be fixed separately by replaying all FDB entries and not just the ones towards the port and the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: switchdev: introduce helper for checking dynamically learned FDB entriesVladimir Oltean1-0/+6
It is a bit difficult to understand what DSA checks when it tries to avoid installing dynamically learned addresses on foreign interfaces as local host addresses, so create a generic switchdev helper that can be reused and is generally more readable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: phy: add API to read 802.3-c45 IDsXu Liang1-0/+1
Add API to read 802.3-c45 IDs so that C22/C45 mixed device can use C45 APIs without failing ID checks. Signed-off-by: Xu Liang <lxu@maxlinear.com> Acked-by: Hauke Mehrtens <hmehrtens@maxlinear.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: tag_8021q: add proper cross-chip notifier supportVladimir Oltean1-13/+3
The big problem which mandates cross-chip notifiers for tag_8021q is this: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] When the user runs: ip link add br0 type bridge ip link set sw0p0 master br0 ip link set sw2p0 master br0 It doesn't work. This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and "other_ds" are at most 1 hop away from each other, so it is sufficient to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice versa and presto, the cross-chip link works. When there is another switch in the middle, such as in this case switch 1 with its DSA links sw1p3 and sw1p4, somebody needs to tell it about these VLANs too. Which is exactly why the problem is quadratic: when a port joins a bridge, for each port in the tree that's already in that same bridge we notify a tag_8021q VLAN addition of that port's RX VLAN to the entire tree. It is a very complicated web of VLANs. It must be mentioned that currently we install tag_8021q VLANs on too many ports (DSA links - to be precise, on all of them). For example, when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there isn't any port of switch 0 that is a member of br0 (at least yet). In theory we could notify only the switches which sit in between the port joining the bridge and the port reacting to that bridge_join event. But in practice that is impossible, because of the way 'link' properties are described in the device tree. The DSA bindings require DT writers to list out not only the real/physical DSA links, but in fact the entire routing table, like for example switch 0 above will have: sw0p3: port@3 { link = <&sw1p4 &sw2p4>; }; This was done because: /* TODO: ideally DSA ports would have a single dp->link_dp member, * and no dst->rtable nor this struct dsa_link would be needed, * but this would require some more complex tree walking, * so keep it stupid at the moment and list them all. */ but it is a perfect example of a situation where too much information is actively detrimential, because we are now in the position where we cannot distinguish a real DSA link from one that is put there to avoid the 'complex tree walking'. And because DT is ABI, there is not much we can change. And because we do not know which DSA links are real and which ones aren't, we can't really know if DSA switch A is in the data path between switches B and C, in the general case. So this is why tag_8021q RX VLANs are added on all DSA links, and probably why it will never change. On the other hand, at least the number of additions/deletions is well balanced, and this means that once we implement reference counting at the cross-chip notifier level a la fdb/mdb, there is absolutely zero need for a struct dsa_8021q_crosschip_link, it's all self-managing. In fact, with the tag_8021q notifiers emitted from the bridge join notifiers, it becomes so generic that sja1105 does not need to do anything anymore, we can just delete its implementation of the .crosschip_bridge_{join,leave} methods. Among other things we can simply delete is the home-grown implementation of sja1105_notify_crosschip_switches(). The reason why that is wrong is because it is not quadratic - it only covers remote switches to which we have a cross-chip bridging link and that does not cover in-between switches. This deletion is part of the same patch because sja1105 used to poke deep inside the guts of the tag_8021q context in order to do that. Because the cross-chip links went away, so needs the sja1105 code. Last but not least, dsa_8021q_setup_port() is simplified (and also renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react on the CPU port too, the four dsa_8021q_vid_apply() calls: - 1 for RX VLAN on user port - 1 for the user port's RX VLAN on the CPU port - 1 for TX VLAN on user port - 1 for the user port's TX VLAN on the CPU port now get squashed into only 2 notifier calls via dsa_port_tag_8021q_vlan_add. And because the notifiers to add and to delete a tag_8021q VLAN are distinct, now we finally break up the port setup and teardown into separate functions instead of relying on a "bool enabled" flag which tells us what to do. Arguably it should have been this way from the get go. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}registerVladimir Oltean1-2/+0
Right now, setting up tag_8021q is a 2-step operation for a driver, first the context structure needs to be created, then the VLANs need to be installed on the ports. A similar thing is true for teardown. Merge the 2 steps into the register/unregister methods, to be as transparent as possible for the driver as to what tag_8021q does behind the scenes. This also gets rid of the funny "bool setup == true means setup, == false means teardown" API that tag_8021q used to expose. Note that dsa_tag_8021q_register() must be called at least in the .setup() driver method and never earlier (like in the driver probe function). This is because the DSA switch tree is not initialized at probe time, and the cross-chip notifiers will not work. For symmetry with .setup(), the unregister method should be put in .teardown(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: make tag_8021q operations part of the coreVladimir Oltean2-9/+8
Make tag_8021q a more central element of DSA and move the 2 driver specific operations outside of struct dsa_8021q_context (which is supposed to hold dynamic data and not really constant function pointers). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: let the core manage the tag_8021q contextVladimir Oltean2-9/+12
The basic problem description is as follows: Be there 3 switches in a daisy chain topology: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] The CPU will not be able to ping through the user ports of the bottom-most switch (like for example sw2p0), simply because tag_8021q was not coded up for this scenario - it has always assumed DSA switch trees with a single switch. To add support for the topology above, we must admit that the RX VLAN of sw2p0 must be added on some ports of switches 0 and 1 as well. This is in fact a textbook example of thing that can use the cross-chip notifier framework that DSA has set up in switch.c. There is only one problem: core DSA (switch.c) is not able right now to make the connection between a struct dsa_switch *ds and a struct dsa_8021q_context *ctx. Right now, it is drivers who call into tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer, and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del} methods. But with cross-chip notifiers, it is possible for tag_8021q to call drivers without drivers having ever asked for anything. A good example is right above: when sw2p0 wants to set itself up for tag_8021q, the .tag_8021q_vlan_add method needs to be called for switches 1 and 0, so that they transport sw2p0's VLANs towards the CPU without dropping them. So instead of letting drivers manage the tag_8021q context, add a tag_8021q_ctx pointer inside of struct dsa_switch, which will be populated when dsa_tag_8021q_register() returns success. The patch is fairly long-winded because we are partly reverting commit 5899ee367ab3 ("net: dsa: tag_8021q: add a context structure") which made the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we can access "ctx" directly from "ds", this is no longer needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpersVladimir Oltean1-0/+6
In preparation of moving tag_8021q to core DSA, move all initialization and teardown related to tag_8021q which is currently done by drivers in 2 functions called "register" and "unregister". These will gather more functionality in future patches, which will better justify the chosen naming scheme. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: tag_8021q: remove struct packet_type declarationVladimir Oltean1-1/+0
This is no longer necessary since tag_8021q doesn't register itself as a full-blown tagger anymore. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: sja1105: delete the best_effort_vlan_filtering modeVladimir Oltean2-9/+1
Simply put, the best-effort VLAN filtering mode relied on VLAN retagging from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to decode the source port in the tagger, but the VLAN retagging implementation inside the sja1105 chips is not the best and we were relying on marginal operating conditions. The most notable limitation of the best-effort VLAN filtering mode is its incapacity to treat this case properly: ip link add br0 type bridge vlan_filtering 1 ip link set swp2 master br0 ip link set swp4 master br0 bridge vlan del dev swp4 vid 1 bridge vlan add dev swp4 vid 1 pvid When sending an untagged packet through swp2, the expectation is for it to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1 on egress). But the switch will send it as egress-untagged. There was an attempt to fix this here: https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/ but it failed miserably because it broke PTP RX timestamping, in a way that cannot be corrected due to hardware issues related to VLAN retagging. So with either PTP broken or pushing VLAN headers on egress for untagged packets being broken, the sad reality is that the best-effort VLAN filtering code is broken. Delete it. Note that this means there will be a temporary loss of functionality in this driver until it is replaced with something better (network stack RX/TX capability for "mode 2" as described in Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware bridge" case). We simply cannot keep this code until that driver rework is done, it is super bloated and tangled with tag_8021q. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20Merge branch 'bridge-vlan-multicast'David S. Miller1-0/+18
Nikolay Aleksandrov says: ==================== net: bridge: multicast: add vlan support This patchset adds initial per-vlan multicast support, most of the code deals with moving to multicast context pointers from bridge/port pointers. That allows us to switch them with the per-vlan contexts when a multicast packet is being processed and vlan multicast snooping has been enabled. That is controlled by a global bridge option added in patch 06 which is off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note that this option can change only under RTNL and doesn't require multicast_lock, so we need to be careful when retrieving mcast contexts in parallel. For packet processing they are switched only once in br_multicast_rcv() and then used until the packet has been processed. For the most part we need these contexts only to read config values and check if they are disabled. The global mcast state which is maintained consists of querier and router timers, the rest are config options. The port mcast state which is maintained consists of query timer and link to router port list if it's ever marked as a router port. Port multicast contexts _must_ be used only with their respective global contexts, that is a bridge port's mcast context must be used only with bridge's global mcast context and a vlan/port's mcast context must be used only with that vlan's global mcast context due to the router port lists. This way a bridge port can be marked as a router in multiple vlans, but might not be a router in some other vlan. Also this allows us to have per-vlan querier elections, per-vlan queries and basically the whole multicast state becomes per-vlan when the option is enabled. One of the hardest parts is synchronization with vlan's memory management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED which is changed only under multicast_lock. When a vlan is being destroyed first that flag is removed under the lock, then the multicast context is torn down which includes waiting for any outstanding context timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED it must be checked first if the contexts are vlan and the multicast_lock has been acquired. That is done by all IGMP/MLD packet processing functions and timers. When processing a packet we have RCU so the vlan memory won't be freed, but if the flag is missing we must not process it. The timers are synchronized in the same way with the addition of waiting for them to finish in case they are running after removing the flag under multicast_lock (i.e. they were waiting for the lock). Multicast vlan snooping requires vlan filtering to be enabled, if it's disabled then snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all vlan disabled checks. We need both flags because one is controlled by user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED). Since the latter is also used for synchronization between the multicast and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we rely on it when checking if a vlan context is disabled. The multicast fast-path has 3 new bit tests on the cache-hot bridge flags field, I didn't observe any measurable difference. I haven't forced either context options to be always disabled when the other type is enabled because the state consists of timers which either expire (router) or don't affect the normal operation. Some options, like the mcast querier one, won't be allowed to change for the disabled context type, that will come with a future patch-set which adds per-vlan querier control. Another important addition is the global vlan options, so far we had only per bridge/port vlan options but in order to control vlan multicast snooping globally we need to add a new type of global vlan options. They can be changed only on the bridge device and are dumped only when a special flag is set in the dump request. The first global option is vlan mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED private flag. It can be set only on master vlan entries. There will be many more global vlan options in the future both for multicast config and other per-vlan options (e.g. STP). There's a lot of room for improvements, I'll do some of the initial ones but splitting the state to different contexts opens the door for a lot more. Also any new multicast options become vlan-supported with very little to no effort by using the same contexts. Short patch description: patches 01-04: initial mcast context add, no functional changes patch 05: adds vlan mcast init and control helpers and uses them on vlan create/destroy patch 06: adds a global bridge mcast vlan snooping knob (default off) patches 07-08: add a helper for users which must derive the contexts based on current bridge and vlan options (e.g. timers) patch 09: adds checks for disabled vlan contexts in packet processing and timers patch 10: adds support for per-vlan querier and tagged queries patch 11: adds router port vlan id in the notifications patches 12-14: add global vlan options support (change, dump, notify) patch 15: adds per-vlan global mcast snooping control Future patch-sets which build on this one (in order): - vlan state mcast handling - user-space mdb contexts (currently only the bridge contexts are used there) - all bridge multicast config options added per-vlan global and per vlan/port - iproute2 support for all the new uAPIs - selftests This set has been stress-tested (deleting/adding ports/vlans while changing vlan mcast snooping while processing IGMP/MLD packets), and also has passed all bridge self-tests. I'm sending this set as early as possible since there're a few more related sets that should go in the same release to get proper and full mcast vlan snooping support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add mcast snooping controlNikolay Aleksandrov1-0/+1
Add a new global vlan option which controls whether multicast snooping is enabled or disabled for a single vlan. It controls the vlan private flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add support for dumping global vlan optionsNikolay Aleksandrov1-0/+1
Add a new vlan options dump flag which causes only global vlan options to be dumped. The dumps are done only with bridge devices, ports are ignored. They support vlan compression if the options in sequential vlans are equal (currently always true). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add support for global optionsNikolay Aleksandrov1-0/+13
We can have two types of vlan options depending on context: - per-device vlan options (split in per-bridge and per-port) - global vlan options The second type wasn't supported in the bridge until now, but we need them for per-vlan multicast support, per-vlan STP support and other options which require global vlan context. They are contained in the global bridge vlan context even if the vlan is not configured on the bridge device itself. This patch adds initial netlink attributes and support for setting these global vlan options, they can only be set (RTM_NEWVLAN) and the operation must use the bridge device. Since there are no such options yet it shouldn't have any functional effect. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: include router port vlan id in notificationsNikolay Aleksandrov1-0/+1
Use the port multicast context to check if the router port is a vlan and in case it is include its vlan id in the notification. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: add vlan mcast snooping knobNikolay Aleksandrov1-0/+2
Add a global knob that controls if vlan multicast snooping is enabled. The proper contexts (vlan or bridge-wide) will be chosen based on the knob when processing packets and changing bridge device state. Note that vlans have their individual mcast snooping enabled by default, but this knob is needed to turn on bridge vlan snooping. It is disabled by default. To enable the knob vlan filtering must also be enabled, it doesn't make sense to have vlan mcast snooping without vlan filtering since that would lead to inconsistencies. Disabling vlan filtering will also automatically disable vlan mcast snooping. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net/tcp_fastopen: remove tcp_fastopen_ctx_lockEric Dumazet1-1/+0
Remove the (per netns) spinlock in favor of xchg() atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Link: https://lore.kernel.org/r/20210719101107.3203943-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-16openvswitch: Introduce per-cpu upcall dispatchMark Gray1-0/+8
The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is a correspondingly large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. The corresponding user space code can be found at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html Bugzilla: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller11-46/+250
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-07-15 The following pull-request contains BPF updates for your *net-next* tree. We've added 45 non-merge commits during the last 15 day(s) which contain a total of 52 files changed, 3122 insertions(+), 384 deletions(-). The main changes are: 1) Introduce bpf timers, from Alexei. 2) Add sockmap support for unix datagram socket, from Cong. 3) Fix potential memleak and UAF in the verifier, from He. 4) Add bpf_get_func_ip helper, from Jiri. 5) Improvements to generic XDP mode, from Kumar. 6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi. =================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16af_unix: Implement unix_dgram_bpf_recvmsg()Cong Wang1-0/+2
We have to implement unix_dgram_bpf_recvmsg() to replace the original ->recvmsg() to retrieve skmsg from ingress_msg. AF_UNIX is again special here because the lack of sk_prot->recvmsg(). I simply add a special case inside unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-16af_unix: Implement ->psock_update_sk_prot()Cong Wang1-0/+10
Now we can implement unix_bpf_update_proto() to update sk_prot, especially prot->close(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-16sock_map: Relax config dependency to CONFIG_NETCong Wang1-18/+20
Currently sock_map still has Kconfig dependency on CONFIG_INET, but there is no actual functional dependency on it after we introduce ->psock_update_sk_prot(). We have to extend it to CONFIG_NET now as we are going to support AF_UNIX. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
2021-07-16bpf: Add bpf_get_func_ip helper for kprobe programsJiri Olsa1-1/+1
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs, so it's now possible to call bpf_get_func_ip from both kprobe and kretprobe programs. Taking the caller's address from 'struct kprobe::addr', which is defined for both kprobe and kretprobe. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
2021-07-16bpf: Add bpf_get_func_ip helper for tracing programsJiri Olsa1-0/+7
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs, specifically for all trampoline attach types. The trampoline's caller IP address is stored in (ctx - 8) address. so there's no reason to actually call the helper, but rather fixup the call instruction and return [ctx - 8] value directly. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
2021-07-16bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ipJiri Olsa1-1/+2
Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it. The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code and is used only by programs with bpf_get_func_ip helper, which is added in following patch and sets call_get_func_ip bit. This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for trampolines that have programs with call_get_func_ip set. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
2021-07-16bpf, x86: Store caller's ip in trampoline stackJiri Olsa1-0/+5
Storing caller's ip in trampoline's stack. Trampoline programs can reach the IP in (ctx - 8) address, so there's no change in program's arguments interface. The IP address is takes from [fp + 8], which is return address from the initial 'call fentry' call to trampoline. This IP address will be returned via bpf_get_func_ip helper helper, which is added in following patches. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
2021-07-15bpf: Teach stack depth check about async callbacks.Alexei Starovoitov1-0/+1
Teach max stack depth checking algorithm about async callbacks that don't increase bpf program stack size. Also add sanity check that bpf_tail_call didn't sneak into async cb. It's impossible, since PTR_TO_CTX is not available in async cb, hence the program cannot contain bpf_tail_call(ctx,...); Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
2021-07-15bpf: Implement verifier support for validation of async callbacks.Alexei Starovoitov1-1/+8
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on PTR_TO_FUNC infra in the verifier to validate addresses to subprograms and pass them into the helpers as function callbacks. In case of bpf_for_each_map_elem() the callback is invoked synchronously and the verifier treats it as a normal subprogram call by adding another bpf_func_state and new frame in __check_func_call(). bpf_timer_set_callback() doesn't invoke the callback directly. The subprogram will be called asynchronously from bpf_timer_cb(). Teach the verifier to validate such async callbacks as special kind of jump by pushing verifier state into stack and let pop_stack() process it. Special care needs to be taken during state pruning. The call insn doing bpf_timer_set_callback has to be a prune_point. Otherwise short timer callbacks might not have prune points in front of bpf_timer_set_callback() which means is_state_visited() will be called after this call insn is processed in __check_func_call(). Which means that another async_cb state will be pushed to be walked later and the verifier will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit. Since push_async_cb() looks like another push_stack() branch the infinite loop detection will trigger false positive. To recognize this case mark such states as in_async_callback_fn. To distinguish infinite loop in async callback vs the same callback called with different arguments for different map and timer add async_entry_cnt to bpf_func_state. Enforce return zero from async callbacks. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
2021-07-15bpf: Prevent pointer mismatch in bpf_timer_init.Alexei Starovoitov1-1/+8
bpf_timer_init() arguments are: 1. pointer to a timer (which is embedded in map element). 2. pointer to a map. Make sure that pointer to a timer actually belongs to that map. Use map_uid (which is unique id of inner map) to reject: inner_map1 = bpf_map_lookup_elem(outer_map, key1) inner_map2 = bpf_map_lookup_elem(outer_map, key2) if (inner_map1 && inner_map2) { timer = bpf_map_lookup_elem(inner_map1); if (timer) // mismatch would have been allowed bpf_timer_init(timer, inner_map2); } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
2021-07-15bpf: Add map side support for bpf timers.Alexei Starovoitov2-11/+34
Restrict bpf timers to array, hash (both preallocated and kmalloced), and lru map types. The per-cpu maps with timers don't make sense, since 'struct bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that the number of timers depends on number of possible cpus and timers would not be accessible from all cpus. lpm map support can be added in the future. The timers in inner maps are supported. The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free bpf_timer in a given map element. Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate that map element indeed contains 'struct bpf_timer'. Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when map element data is reused in preallocated htab and lru maps. Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single map element. There could be one of each, but not more than one. Due to 'one bpf_timer in one element' restriction do not support timers in global data, since global data is a map of single element, but from bpf program side it's seen as many global variables and restriction of single global timer would be odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with timers, since user space could have corrupted mmap element and crashed the kernel. The maps with timers cannot be readonly. Due to these restrictions search for bpf_timer in datasec BTF in case it was placed in the global data to report clear error. The previous patch allowed 'struct bpf_timer' as a first field in a map element only. Relax this restriction. Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free the timer when lru map deletes an element as a part of it eviction algorithm. Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store. The timer operation are done through helpers only. This is similar to 'struct bpf_spin_lock'. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
2021-07-15bpf: Introduce bpf timers.Alexei Starovoitov2-0/+76
Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded in hash/array/lru maps as a regular field and helpers to operate on it: // Initialize the timer. // First 4 bits of 'flags' specify clockid. // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed. long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags); // Configure the timer to call 'callback_fn' static function. long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn); // Arm the timer to expire 'nsec' nanoseconds from the current time. long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags); // Cancel the timer and wait for callback_fn to finish if it was running. long bpf_timer_cancel(struct bpf_timer *timer); Here is how BPF program might look like: struct map_elem { int counter; struct bpf_timer timer; }; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1000); __type(key, int); __type(value, struct map_elem); } hmap SEC(".maps"); static int timer_cb(void *map, int *key, struct map_elem *val); /* val points to particular map element that contains bpf_timer. */ SEC("fentry/bpf_fentry_test1") int BPF_PROG(test1, int a) { struct map_elem *val; int key = 0; val = bpf_map_lookup_elem(&hmap, &key); if (val) { bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME); bpf_timer_set_callback(&val->timer, timer_cb); bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0); } } This patch adds helper implementations that rely on hrtimers to call bpf functions as timers expire. The following patches add necessary safety checks. Only programs with CAP_BPF are allowed to use bpf_timer. The amount of timers used by the program is constrained by the memcg recorded at map creation time. The bpf_timer_init() helper needs explicit 'map' argument because inner maps are dynamic and not known at load time. While the bpf_timer_set_callback() is receiving hidden 'aux->prog' argument supplied by the verifier. The prog pointer is needed to do refcnting of bpf program to make sure that program doesn't get freed while the timer is armed. This approach relies on "user refcnt" scheme used in prog_array that stores bpf programs for bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is paired with bpf_timer_cancel() that will drop the prog refcnt. The ops->map_release_uref is responsible for cancelling the timers and dropping prog refcnt when user space reference to a map reaches zero. This uref approach is done to make sure that Ctrl-C of user space process will not leave timers running forever unless the user space explicitly pinned a map that contained timers in bpffs. bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't have user references (is not held by open file descriptor from user space and not pinned in bpffs). The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel and free the timer if given map element had it allocated. "bpftool map update" command can be used to cancel timers. The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because '__u64 :64' has 1 byte alignment of 8 byte padding. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
2021-07-15bus: mhi: pci-generic: configurable network interface MRURichard Laing1-0/+2
The MRU value used by the MHI MBIM network interface affects the throughput performance of the interface. Different modem models use different default MRU sizes based on their bandwidth capabilities. Large values generally result in higher throughput for larger packet sizes. In addition if the MRU used by the MHI device is larger than that specified in the MHI net device the data is fragmented and needs to be re-assembled which generates a (single) warning message about the fragmented packets. Setting the MRU on both ends avoids the extra processing to re-assemble the packets. This patch allows the documented MRU for a modem to be automatically set as the MHI net device MRU avoiding fragmentation and improving throughput performance. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15bpf: Fix a typo of reuseport map in bpf.h.Kuniyuki Iwashima1-1/+1
Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo in bpf.h. Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210714124317.67526-1-kuniyu@amazon.co.jp
2021-07-14Merge tag 'net-5.14-rc2' of ↵Linus Torvalds20-223/+105
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski. "Including fixes from bpf and netfilter. Current release - regressions: - sock: fix parameter order in sock_setsockopt() Current release - new code bugs: - netfilter: nft_last: - fix incorrect arithmetic when restoring last used - honor NFTA_LAST_SET on restoration Previous releases - regressions: - udp: properly flush normal packet at GRO time - sfc: ensure correct number of XDP queues; don't allow enabling the feature if there isn't sufficient resources to Tx from any CPU - dsa: sja1105: fix address learning getting disabled on the CPU port - mptcp: addresses a rmem accounting issue that could keep packets in subflow receive buffers longer than necessary, delaying MPTCP-level ACKs - ip_tunnel: fix mtu calculation for ETHER tunnel devices - do not reuse skbs allocated from skbuff_fclone_cache in the napi skb cache, we'd try to return them to the wrong slab cache - tcp: consistently disable header prediction for mptcp Previous releases - always broken: - bpf: fix subprog poke descriptor tracking use-after-free - ipv6: - allocate enough headroom in ip6_finish_output2() in case iptables TEE is used - tcp: drop silly ICMPv6 packet too big messages to avoid expensive and pointless lookups (which may serve as a DDOS vector) - make sure fwmark is copied in SYNACK packets - fix 'disable_policy' for forwarded packets (align with IPv4) - netfilter: conntrack: - do not renew entry stuck in tcp SYN_SENT state - do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry - mptcp: cleanly handle error conditions with MP_JOIN and syncookies - mptcp: fix double free when rejecting a join due to port mismatch - validate lwtstate->data before returning from skb_tunnel_info() - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path - mt76: mt7921: continue to probe driver when fw already downloaded - bonding: fix multiple issues with offloading IPsec to (thru?) bond - stmmac: ptp: fix issues around Qbv support and setting time back - bcmgenet: always clear wake-up based on energy detection Misc: - sctp: move 198 addresses from unusable to private scope - ptp: support virtual clocks and timestamping - openvswitch: optimize operation for key comparison" * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits) net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave() sfc: add logs explaining XDP_TX/REDIRECT is not available sfc: ensure correct number of XDP queues sfc: fix lack of XDP TX queues - error XDP TX failed (-22) net: fddi: fix UAF in fza_probe net: dsa: sja1105: fix address learning getting disabled on the CPU port net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload net: Use nlmsg_unicast() instead of netlink_unicast() octeontx2-pf: Fix uninitialized boolean variable pps ipv6: allocate enough headroom in ip6_finish_output2() net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific net: bridge: multicast: fix MRD advertisement router port marking race net: bridge: multicast: fix PIM hello router port marking race net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340 dsa: fix for_each_child.cocci warnings virtio_net: check virtqueue_add_sgs() return value mptcp: properly account bulk freed memory selftests: mptcp: fix case multiple subflows limited by server mptcp: avoid processing packet if a subflow reset mptcp: fix syncookie process if mptcp can not_accept new subflow ...
2021-07-14fs: add vfs_parse_fs_param_source() helperChristian Brauner1-0/+2
Add a simple helper that filesystems can use in their parameter parser to parse the "source" parameter. A few places open-coded this function and that already caused a bug in the cgroup v1 parser that we fixed. Let's make it harder to get this wrong by introducing a helper which performs all necessary checks. Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12 Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-12mm: Make copy_huge_page() always availableMatthew Wilcox (Oracle)2-5/+1
Rewrite copy_huge_page() and move it into mm/util.c so it's always available. Fixes an exposure of uninitialised memory on configurations with HUGETLB and UFFD enabled and MIGRATION disabled. Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-11net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340Marek Behún1-5/+1
It seems that we cannot differentiate 88X3310 from 88X3340 by simply looking at bit 3 of revision ID. This only works on revisions A0 and A1. On revision B0, this bit is always 1. Instead use the 3.d00d register for differentiation, since this register contains information about number of ports on the device. Fixes: 9885d016ffa9 ("net: phy: marvell10g: add separate structure for 88X3340") Signed-off-by: Marek Behún <kabel@kernel.org> Reported-by: Matteo Croce <mcroce@linux.microsoft.com> Tested-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10Merge tag 'kbuild-v5.14' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Increase the -falign-functions alignment for the debug option. - Remove ugly libelf checks from the top Makefile. - Make the silent build (-s) more silent. - Re-compile the kernel if KBUILD_BUILD_TIMESTAMP is specified. - Various script cleanups * tag 'kbuild-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (27 commits) scripts: add generic syscallnr.sh scripts: check duplicated syscall number in syscall table sparc: syscalls: use pattern rules to generate syscall headers parisc: syscalls: use pattern rules to generate syscall headers nds32: add arch/nds32/boot/.gitignore kbuild: mkcompile_h: consider timestamp if KBUILD_BUILD_TIMESTAMP is set kbuild: modpost: Explicitly warn about unprototyped symbols kbuild: remove trailing slashes from $(KBUILD_EXTMOD) kconfig.h: explain IS_MODULE(), IS_ENABLED() kconfig: constify long_opts scripts/setlocalversion: simplify the short version part scripts/setlocalversion: factor out 12-chars hash construction scripts/setlocalversion: add more comments to -dirty flag detection scripts/setlocalversion: remove workaround for old make-kpkg scripts/setlocalversion: remove mercurial, svn and git-svn supports kbuild: clean up ${quiet} checks in shell scripts kbuild: sink stdout from cmd for silent build init: use $(call cmd,) for generating include/generated/compile.h kbuild: merge scripts/mkmakefile to top Makefile sh: move core-y in arch/sh/Makefile to arch/sh/Kbuild ...
2021-07-10Merge tag 's390-5.14-2' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Vasily Gorbik: - Fix preempt_count initialization. - Rework call_on_stack() macro to add proper type handling and avoid possible register corruption. - More error prone "register asm" removal and fixes. - Fix syscall restarting when multiple signals are coming in. This adds minimalistic trampolines to vdso so we can return from signal without using the stack which requires pgm check handler hacks when NX is enabled. - Remove HAVE_IRQ_EXIT_ON_IRQ_STACK since this is no longer true after switch to generic entry. - Fix protected virtualization secure storage access exception handling. - Make machine check C handler always enter with DAT enabled and move register validation to C code. - Fix tinyconfig boot problem by avoiding MONITOR CALL without CONFIG_BUG. - Increase asm symbols alignment to 16 to make it consistent with compilers. - Enable concurrent access to the CPU Measurement Counter Facility. - Add support for dynamic AP bus size limit and rework ap_dqap to deal with messages greater than recv buffer. * tag 's390-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits) s390: preempt: Fix preempt_count initialization s390/linkage: increase asm symbols alignment to 16 s390: rename CALL_ON_STACK_NORETURN() to call_on_stack_noreturn() s390: add type checking to CALL_ON_STACK_NORETURN() macro s390: remove old CALL_ON_STACK() macro s390/softirq: use call_on_stack() macro s390/lib: use call_on_stack() macro s390/smp: use call_on_stack() macro s390/kexec: use call_on_stack() macro s390/irq: use call_on_stack() macro s390/mm: use call_on_stack() macro s390: introduce proper type handling call_on_stack() macro s390/irq: simplify on_async_stack() s390/irq: inline do_softirq_own_stack() s390/irq: simplify do_softirq_own_stack() s390/ap: get rid of register asm in ap_dqap() s390: rename PIF_SYSCALL_RESTART to PIF_EXECVE_PGSTE_RESTART s390: move restart of execve() syscall s390/signal: remove sigreturn on stack s390/signal: switch to using vdso for sigreturn and syscall restart ...
2021-07-10Merge tag 'arm-drivers-5.14' of ↵Linus Torvalds8-21/+429
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM driver updates from Olof Johansson: - Reset controllers: Adding support for Microchip Sparx5 Switch. - Memory controllers: ARM Primecell PL35x SMC memory controller driver cleanups and improvements. - i.MX SoC drivers: Power domain support for i.MX8MM and i.MX8MN. - Rockchip: RK3568 power domains support + DT binding updates, cleanups. - Qualcomm SoC drivers: Amend socinfo with more SoC/PMIC details, including support for MSM8226, MDM9607, SM6125 and SC8180X. - ARM FFA driver: "Firmware Framework for ARMv8-A", defining management interfaces and communication (including bus model) between partitions both in Normal and Secure Worlds. - Tegra Memory controller changes, including major rework to deal with identity mappings at boot and integration with ARM SMMU pieces. * tag 'arm-drivers-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (120 commits) firmware: turris-mox-rwtm: add marvell,armada-3700-rwtm-firmware compatible string firmware: turris-mox-rwtm: show message about HWRNG registration firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng firmware: turris-mox-rwtm: report failures better firmware: turris-mox-rwtm: fix reply status decoding function soc: imx: gpcv2: add support for i.MX8MN power domains dt-bindings: add defines for i.MX8MN power domains firmware: tegra: bpmp: Fix Tegra234-only builds iommu/arm-smmu: Use Tegra implementation on Tegra186 iommu/arm-smmu: tegra: Implement SID override programming iommu/arm-smmu: tegra: Detect number of instances at runtime dt-bindings: arm-smmu: Add Tegra186 compatible string firmware: qcom_scm: Add MDM9607 compatible soc: qcom: rpmpd: Add MDM9607 RPM Power Domains soc: renesas: Add support to read LSI DEVID register of RZ/G2{L,LC} SoC's soc: renesas: Add ARCH_R9A07G044 for the new RZ/G2L SoC's dt-bindings: soc: rockchip: drop unnecessary #phy-cells from grf.yaml memory: emif: remove unused frequency and voltage notifiers memory: fsl_ifc: fix leak of private memory on probe failure memory: fsl_ifc: fix leak of IO mapping on probe failure ...
2021-07-10Merge tag 'arm-dt-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/socLinus Torvalds2-1/+2
Pull ARM devicetree updates from Olof Johansson: "Like always, the DT branch is sizable. There are numerous additions and fixes to existing platforms, but also a handful of new ones introduced. Less than some other releases, but there's been significant work on cleanups, refactorings and device enabling on existing platforms. A non-exhaustive list of new material: - Refactoring of BCM2711 dtsi structure to add support for the Raspberry Pi 400 - Rockchip: RK3568 SoC and EVB, video codecs for rk3036/3066/3188/322x - Qualcomm: SA8155p Automotive platform (SM8150 derivative), SM8150/8250 enhancements and support for Sony Xperia 1/1II and 5/5II - TI K3: PCI/USB3 support on AM64-sk boards, R5 remoteproc definitions - TI OMAP: Various cleanups - Tegra: Audio support for Jetson Xavier NX, SMMU support on Tegra194 - Qualcomm: lots of additions for peripherals across several SoCs, and new support for Microsoft Surface Duo (SM8150-based), Huawei Ascend G7. - i.MX: Numerous additions of features across SoCs and boards. - Allwinner: More device bindings for V3s, Forlinx OKA40i-C and NanoPi R1S H5 boards - MediaTek: More device bindings for mt8167, new Chromebook system variants for mt8183 - Renesas: RZ/G2L SoC and EVK added - Amlogic: BananaPi BPI-M5 board added" * tag 'arm-dt-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (511 commits) arm64: dts: rockchip: add basic dts for RK3568 EVB arm64: dts: rockchip: add core dtsi for RK3568 SoC arm64: dts: rockchip: add generic pinconfig settings used by most Rockchip socs ARM: dts: rockchip: add vpu and vdec node for RK322x ARM: dts: rockchip: add vpu nodes for RK3066 and RK3188 ARM: dts: rockchip: add vpu node for RK3036 arm64: dts: ipq8074: Add QUP6 I2C node arm64: dts: rockchip: Re-add regulator-always-on for vcc_sdio for rk3399-roc-pc arm64: dts: rockchip: Re-add regulator-boot-on, regulator-always-on for vdd_gpu on rk3399-roc-pc arm64: dts: rockchip: add ir-receiver for rk3399-roc-pc arm64: dts: rockchip: Add USB-C port details for rk3399 Firefly arm64: dts: rockchip: Sort rk3399 firefly pinmux entries arm64: dts: rockchip: add infrared receiver node to RK3399 Firefly arm64: dts: rockchip: add SPDIF node for rk3399-firefly arm64: dts: rockchip: Add Rotation Property for OGA Panel arm64: dts: qcom: sc7180: bus votes for eMMC and SD card arm64: dts: qcom: sm8250-edo: Add Samsung touchscreen arm64: dts: qcom: sm8250-edo: Enable GPI DMA arm64: dts: qcom: sm8250-edo: Enable ADSP/CDSP/SLPI arm64: dts: qcom: sm8250-edo: Enable PCIe ...
2021-07-10Merge tag 'arm-soc-5.14' of ↵Linus Torvalds7-9/+183
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC updates from Olof Johansson: "A few SoC (code) changes have queued up this cycle, mostly for minor changes and some refactoring and cleanup of legacy platforms. This branch also contains a few of the fixes that weren't sent in by the end of the release (all fairly minor). - Adding an additional maintainer for the TEE subsystem (Sumit Garg) - Quite a significant modernization of the IXP4xx platforms by Linus Walleij, revisiting with a new PCI host driver/binding, removing legacy mach/* include dependencies and moving platform detection/config to drivers/soc. Also some updates/cleanup of platform data. - Core power domain support for Tegra platforms, and some improvements in build test coverage by adding stubs for compile test targets. - A handful of updates to i.MX platforms, adding legacy (non-PSCI) SMP support on i.MX7D, SoC ID setup for i.MX50, removal of platform data and board fixups for iMX6/7. ... and a few smaller changes and fixes for Samsung, OMAP, Allwinner, Rockchip" * tag 'arm-soc-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (53 commits) MAINTAINERS: Add myself as TEE subsystem reviewer ixp4xx: fix spelling mistake in Kconfig "Devce" -> "Device" hw_random: ixp4xx: Add OF support hw_random: ixp4xx: Add DT bindings hw_random: ixp4xx: Turn into a module hw_random: ixp4xx: Use SPDX license tag hw_random: ixp4xx: enable compile-testing pata: ixp4xx: split platform data to its own header soc: ixp4xx: move cpu detection to linux/soc/ixp4xx/cpu.h PCI: ixp4xx: Add a new driver for IXP4xx PCI: ixp4xx: Add device tree bindings for IXP4xx ARM/ixp4xx: Make NEED_MACH_IO_H optional ARM/ixp4xx: Move the virtual IObases MAINTAINERS: ARM/MStar/Sigmastar SoCs: Add a link to the MStar tree ARM: debug: add UART early console support for MSTAR SoCs ARM: dts: ux500: Fix LED probing ARM: imx: add smp support for imx7d ARM: imx6q: drop of_platform_default_populate() from init_machine arm64: dts: rockchip: Update RK3399 PCI host bridge window to 32-bit address memory soc/tegra: fuse: Fix Tegra234-only builds ...
2021-07-10mptcp: avoid processing packet if a subflow resetJianguo Wu1-2/+3
If check_fully_established() causes a subflow reset, it should not continue to process the packet in tcp_data_queue(). Add a return value to mptcp_incoming_options(), and return false if a subflow has been reset, else return true. Then drop the packet in tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options() return false. Fixes: d582484726c4 ("mptcp: fix fallback for MP_JOIN subflows") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-0/+1
Daniel Borkmann says: ==================== pull-request: bpf 2021-07-09 The following pull-request contains BPF updates for your *net* tree. We've added 9 non-merge commits during the last 9 day(s) which contain a total of 13 files changed, 118 insertions(+), 62 deletions(-). The main changes are: 1) Fix runqslower task->state access from BPF, from SanjayKumar Jeyakumar. 2) Fix subprog poke descriptor tracking use-after-free, from John Fastabend. 3) Fix sparse complaint from prior devmap RCU conversion, from Toke Høiland-Jørgensen. 4) Fix missing va_end in bpftool JIT json dump's error path, from Gu Shengxian. 5) Fix tools/bpf install target from missing runqslower install, from Wei Li. 6) Fix xdpsock BPF sample to unload program on shared umem option, from Wang Hai. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: validate lwtstate->data before returning from skb_tunnel_info()Taehee Yoo1-1/+3
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info type without validation. lwtstate->data can have various types such as mpls_iptunnel_encap, etc and these are not compatible. So skb_tunnel_info() should validate before returning that pointer. Splat looks like: BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan] Read of size 2 at addr ffff888106ec2698 by task ping/811 CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195 Call Trace: dump_stack_lvl+0x56/0x7b print_address_description.constprop.8.cold.13+0x13/0x2ee ? vxlan_get_route+0x418/0x4b0 [vxlan] ? vxlan_get_route+0x418/0x4b0 [vxlan] kasan_report.cold.14+0x83/0xdf ? vxlan_get_route+0x418/0x4b0 [vxlan] vxlan_get_route+0x418/0x4b0 [vxlan] [ ... ] vxlan_xmit_one+0x148b/0x32b0 [vxlan] [ ... ] vxlan_xmit+0x25c5/0x4780 [vxlan] [ ... ] dev_hard_start_xmit+0x1ae/0x6e0 __dev_queue_xmit+0x1f39/0x31a0 [ ... ] neigh_xmit+0x2f9/0x940 mpls_xmit+0x911/0x1600 [mpls_iptunnel] lwtunnel_xmit+0x18f/0x450 ip_finish_output2+0x867/0x2040 [ ... ] Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-blockLinus Torvalds4-36/+7
Pull more block updates from Jens Axboe: "A combination of changes that ended up depending on both the driver and core branch (and/or the IDE removal), and a few late arriving fixes. In detail: - Fix io ticks wrap-around issue (Chunguang) - nvme-tcp sock locking fix (Maurizio) - s390-dasd fixes (Kees, Christoph) - blk_execute_rq polling support (Keith) - blk-cgroup RCU iteration fix (Yu) - nbd backend ID addition (Prasanna) - Partition deletion fix (Yufen) - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph) - Removal of now dead block request types due to IDE removal (Christoph) - Loop probing and control device cleanups (Christoph) - Device uevent fix (Christoph) - Misc cleanups/fixes (Tetsuo, Christoph)" * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits) blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs block: fix the problem of io_ticks becoming smaller nvme-tcp: can't set sk_user_data without write_lock loop: remove unused variable in loop_set_status() block: remove the bdgrab in blk_drop_partitions block: grab a device refcount in disk_uevent s390/dasd: Avoid field over-reading memcpy() dasd: unexport dasd_set_target_state block: check disk exist before trying to add partition ubd: remove dead code in ubd_setup_common nvme: use return value from blk_execute_rq() block: return errors from blk_execute_rq() nvme: use blk_execute_rq() for passthrough commands block: support polling through blk_execute_rq block: remove REQ_OP_SCSI_{IN,OUT} block: mark blk_mq_init_queue_data static loop: rewrite loop_exit using idr_for_each_entry loop: split loop_lookup loop: don't allow deleting an unspecified loop device loop: move loop_ctl_mutex locking into loop_add ...
2021-07-09Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds4-3/+39
Pull virtio,vhost,vdpa updates from Michael Tsirkin: - Doorbell remapping for ifcvf, mlx5 - virtio_vdpa support for mlx5 - Validate device input in several drivers (for SEV and friends) - ZONE_MOVABLE aware handling in virtio-mem - Misc fixes, cleanups * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block Mode virtio-mem: simplify high-level unplug handling in Big Block Mode virtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block Mode virtio-mem: simplify high-level unplug handling in Sub Block Mode virtio-mem: simplify high-level plug handling in Sub Block Mode virtio-mem: use page_zonenum() in virtio_mem_fake_offline() virtio-mem: don't read big block size in Sub Block Mode virtio/vdpa: clear the virtqueue state during probe vp_vdpa: allow set vq state to initial state after reset virtio-pci library: introduce vp_modern_get_driver_features() vdpa: support packed virtqueue for set/get_vq_state() virtio-ring: store DMA metadata in desc_extra for split virtqueue virtio: use err label in __vring_new_virtqueue() virtio_ring: introduce virtqueue_desc_add_split() virtio_ring: secure handling of mapping errors virtio-ring: factor out desc_extra allocation virtio_ring: rename vring_desc_extra_packed virtio-ring: maintain next in extra state for packed virtqueue vdpa/mlx5: Clear vq ready indication upon device reset vdpa/mlx5: Add support for doorbell bypassing ...
2021-07-09Merge branch 'linus' of ↵Linus Torvalds1-6/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - Regression fix in drbg due to missing self-test for new default algorithm - Add ratelimit on user-triggerable message in qat - Fix build failure due to missing dependency in sl3516 - Remove obsolete PageSlab checks - Fix bogus hardware register writes on Kunpeng920 in hisilicon/sec * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: hisilicon/sec - fix the process of disabling sva prefetching crypto: sl3516 - Add dependency on ARCH_GEMINI crypto: sl3516 - Typo s/Stormlink/Storlink/ crypto: drbg - self test for HMAC(SHA-512) crypto: omap - Drop obsolete PageSlab check crypto: scatterwalk - Remove obsolete PageSlab check crypto: qat - ratelimit invalid ioctl message and print the invalid cmd
2021-07-09Merge tag 'for-linus-5.14-rc1' of ↵Linus Torvalds3-0/+204
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for optimized routines based on the host CPU - Support for PCI via virtio - Various fixes * tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: remove unneeded semicolon in um_arch.c um: Remove the repeated declaration um: fix error return code in winch_tramp() um: fix error return code in slip_open() um: Fix stack pointer alignment um: implement flush_cache_vmap/flush_cache_vunmap um: add a UML specific futex implementation um: enable the use of optimized xor routines in UML um: Add support for host CPU flags and alignment um: allow not setting extra rpaths in the linux binary um: virtio/pci: enable suspend/resume um: add PCI over virtio emulation driver um: irqs: allow invoking time-travel handler multiple times um: time-travel/signals: fix ndelay() in interrupt um: expose time-travel mode to userspace side um: export signals_enabled directly um: remove unused smp_sigio_handler() declaration lib: add iomem emulation (logic_iomem) um: allow disabling NO_IOMEM
2021-07-09Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Ext4 regression and bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: inline jbd2_journal_[un]register_shrinker() ext4: fix flags validity checking for EXT4_IOC_CHECKPOINT ext4: fix possible UAF when remounting r/o a mmp-protected file system ext4: use ext4_grp_locked_error in mb_find_extent ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock Revert "ext4: consolidate checks for resize of bigalloc into ext4_resize_begin"