summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-02-15net: bridge: remove __br_vlan_filter_toggleVladimir Oltean3-10/+4
This function is identical with br_vlan_filter_toggle. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15Merge branch 'PTP-for-DSA-tag_ocelot_8021q'David S. Miller13-495/+825
Vladimir Oltean says: ==================== PTP for DSA tag_ocelot_8021q Changes in v2: Add stub definition for ocelot_port_inject_frame when switch driver is not compiled in. This is part two of the errata workaround begun here: https://patchwork.kernel.org/project/netdevbpf/cover/20210129010009.3959398-1-olteanv@gmail.com/ Now that we have basic traffic support when we operate the Ocelot DSA switches without an NPI port, it would be nice to regain some of the features lost due to the lack of the NPI port functionality. An important one is PTP timestamping, which is intimately tied to the DSA frame header added by the NPI port: on TX, we put a "timestamp request ID" in the Injection Frame Header, while on RX, the Extraction Frame Header contains a partial 32-bit PTP timestamp. Get rid of the NPI port and replace it with a VLAN-based tagger, and you lose PTP, right? Well, not quite, this is what this patch series is about. The NPI port is basically a regular Ethernet port configured to service the packets in and out of the switch's CPU port module (which has other non-DSA I/O mechanisms too, such as register-based MMIO and DMA). If we disable the NPI port, we can in theory still access the packets delivered to the CPU port module by doing exactly what the ocelot switchdev driver does: extracting Ethernet packets through registers (yes, it is as icky as it sounds). However, there's a catch. The Felix switch was integrated into NXP LS1028A with the idea in mind that it will operate as DSA, i.e. using the CPU port module connected to the NPI port, not having I/O over register-based MMIO which is painfully slow and CPU intensive. So register-based packet I/O not supposed to work - those registers aren't even documented in the hardware reference manual for Felix. However they kinda do, with the exception of the fact that an RX interrupt was really not wired to the CPU cores - so we don't know when the CPU port module receives a new packet. But we can hack even around that, by replicating every packet that goes to the CPU port module and making it also go to a plain internal Ethernet port. Then drop the Ethernet packet and read the other copy of it from the CPU port module, this time annotated with the much-wanted RX timestamp. This is all fine and it works, but it does raise some questions about what DSA even is anymore, if we start having switches that inject some of their packets over Ethernet and some through registers, where do we draw the line. In principle I believe these concerns are founded, but at the same time, the way that the Felix driver uses register MMIO based packet I/O is fundamentally the same as any other DSA driver capable of PTP makes use of a side-channel for timestamps like a FIFO (just that this one is a lot more complicated, and comes with the entire actual packet, not just the timestamp). Nonetheless, I tried to keep the extra pressure added by this ERR workaround upon the DSA subsystem as small as possible, so some of the patches are just a revisit of some of Andrew's complaints w.r.t. the fact that tag_ocelot already violates any driver <-> tagger boundary, and as a consequence, is not able to be used on testbeds such as dsa_loop (which it now can). So now, the tag_ocelot and tag_ocelot_8021q drivers should be dsa_loop-clean, and have the ERR workarounds as self-contained as possible, using all the designated features for PTP timestamping and nothing more. Comments appreciated. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: dsa: tag_ocelot_8021q: add support for PTP timestampingVladimir Oltean5-2/+115
For TX timestamping, we use the felix_txtstamp method which is common with the regular (non-8021q) ocelot tagger. This method says that skb deferral is needed, prepares a timestamp request ID, and puts a clone of the skb in a queue waiting for the timestamp IRQ. felix_txtstamp is called by dsa_skb_tx_timestamp() just before the tagger's xmit method. In the tagger xmit, we divert the packets classified by dsa_skb_tx_timestamp() as PTP towards the MMIO-based injection registers, and we declare them as dead towards dsa_slave_xmit. If not PTP, we proceed with normal tag_8021q stuff. Then the timestamp IRQ fires, the clone queued up from felix_txtstamp is matched to the TX timestamp retrieved from the switch's FIFO based on the timestamp request ID, and the clone is delivered to the stack. On RX, thanks to the VCAP IS2 rule that redirects the frames with an EtherType for 1588 towards two destinations: - the CPU port module (for MMIO based extraction) and - if the "no XTR IRQ" workaround is in place, the dsa_8021q CPU port the relevant data path processing starts in the ptp_classify_raw BPF classifier installed by DSA in the RX data path (post tagger, which is completely unaware that it saw a PTP packet). This time we can't reuse the same implementation of .port_rxtstamp that also works with the default ocelot tagger. That is because felix_rxtstamp is given an skb with a freshly stripped DSA header, and it says "I don't need deferral for its RX timestamp, it's right in it, let me show you"; and it just points to the header right behind skb->data, from where it unpacks the timestamp and annotates the skb with it. The same thing cannot happen with tag_ocelot_8021q, because for one thing, the skb did not have an extraction frame header in the first place, but a VLAN tag with no timestamp information. So the code paths in felix_rxtstamp for the regular and 8021q tagger are completely independent. With tag_8021q, the timestamp must come from the packet's duplicate delivered to the CPU port module, but there is potentially complex logic to be handled [ and prone to reordering ] if we were to just start reading packets from the CPU port module, and try to match them to the one we received over Ethernet and which needs an RX timestamp. So we do something simple: we tell DSA "give me some time to think" (we request skb deferral by returning false from .port_rxtstamp) and we just drop the frame we got over Ethernet with no attempt to match it to anything - we just treat it as a notification that there's data to be processed from the CPU port module's queues. Then we proceed to read the packets from those, one by one, which we deliver up the stack, timestamped, using netif_rx - the same function that any driver would use anyway if it needed RX timestamp deferral. So the assumption is that we'll come across the PTP packet that triggered the CPU extraction notification eventually, but we don't know when exactly. Thanks to the VCAP IS2 trap/redirect rule and the exclusion of the CPU port module from the flooding replicators, only PTP frames should be present in the CPU port module's RX queues anyway. There is just one conflict between the VCAP IS2 trapping rule and the semantics of the BPF classifier. Namely, ptp_classify_raw() deems general messages as non-timestampable, but still, those are trapped to the CPU port module since they have an EtherType of ETH_P_1588. So, if the "no XTR IRQ" workaround is in place, we need to run another BPF classifier on the frames extracted over MMIO, to avoid duplicates being sent to the stack (once over Ethernet, once over MMIO). It doesn't look like it's possible to install VCAP IS2 rules based on keys extracted from the 1588 frame headers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: dsa: felix: setup MMIO filtering rules for PTP when using tag_8021qVladimir Oltean4-3/+143
Since the tag_8021q tagger is software-defined, it has no means by itself for retrieving hardware timestamps of PTP event messages. Because we do want to support PTP on ocelot even with tag_8021q, we need to use the CPU port module for that. The RX timestamp is present in the Extraction Frame Header. And because we can't use NPI mode which redirects the CPU queues to an "external CPU" (meaning the ARM CPU running Linux), then we need to poll the CPU port module through the MMIO registers to retrieve TX and RX timestamps. Sadly, on NXP LS1028A, the Felix switch was integrated into the SoC without wiring the extraction IRQ line to the ARM GIC. So, if we want to be notified of any PTP packets received on the CPU port module, we have a problem. There is a possible workaround, which is to use the Ethernet CPU port as a notification channel that packets are available on the CPU port module as well. When a PTP packet is received by the DSA tagger (without timestamp, of course), we go to the CPU extraction queues, poll for it there, then we drop the original Ethernet packet and masquerade the packet retrieved over MMIO (plus the timestamp) as the original when we inject it up the stack. Create a quirk in struct felix is selected by the Felix driver (but not by Seville, since that doesn't support PTP at all). We want to do this such that the workaround is minimally invasive for future switches that don't require this workaround. The only traffic for which we need timestamps is PTP traffic, so add a redirection rule to the CPU port module for this. Currently we only have the need for PTP over L2, so redirection rules for UDP ports 319 and 320 are TBD for now. Note that for the workaround of matching of PTP-over-Ethernet-port with PTP-over-MMIO queues to work properly, both channels need to be absolutely lossless. There are two parts to achieving that: - We keep flow control enabled on the tag_8021q CPU port - We put the DSA master interface in promiscuous mode, so it will never drop a PTP frame (for the profiles we are interested in, these are sent to the multicast MAC addresses of 01-80-c2-00-00-0e and 01-1b-19-00-00-00). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: refactor ocelot_xtr_irq_handler into ocelot_xtr_pollVladimir Oltean3-135/+163
Since the felix DSA driver will need to poll the CPU port module for extracted frames as well, let's create some common functions that read an Extraction Frame Header, and then an skb, from a CPU extraction group. We abuse the struct ocelot_ops :: port_to_netdev function a little bit, in order to retrieve the DSA port net_device or the ocelot switchdev net_device based on the source port information from the Extraction Frame Header, but it's all in the benefit of code simplification - netdev_alloc_skb needs it. Originally, the port_to_netdev method was intended for parsing act->dev from tc flower offload code. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: dsa: tag_ocelot: create separate tagger for SevilleVladimir Oltean7-83/+70
The ocelot tagger is a hot mess currently, it relies on memory initialized by the attached driver for basic frame transmission. This is against all that DSA tagging protocols stand for, which is that the transmission and reception of a DSA-tagged frame, the data path, should be independent from the switch control path, because the tag protocol is in principle hot-pluggable and reusable across switches (even if in practice it wasn't until very recently). But if another driver like dsa_loop wants to make use of tag_ocelot, it couldn't. This was done to have common code between Felix and Ocelot, which have one bit difference in the frame header format. Quoting from commit 67c2404922c2 ("net: dsa: felix: create a template for the DSA tags on xmit"): Other alternatives have been analyzed, such as: - Create a separate tag_seville.c: too much code duplication for just 1 bit field difference. - Create a separate DSA_TAG_PROTO_SEVILLE under tag_ocelot.c, just like tag_brcm.c, which would have a separate .xmit function. Again, too much code duplication for just 1 bit field difference. - Allocate the template from the init function of the tag_ocelot.c module, instead of from the driver: couldn't figure out a method of accessing the correct port template corresponding to the correct tagger in the .xmit function. The really interesting part is that Seville should have had its own tagging protocol defined - it is not compatible on the wire with Ocelot, even for that single bit. In principle, a packet generated by DSA_TAG_PROTO_OCELOT when booted on NXP LS1028A would look in a certain way, but when booted on NXP T1040 it would look differently. The reverse is also true: a packet generated by a Seville switch would be interpreted incorrectly by Wireshark if it was told it was generated by an Ocelot switch. Actually things are a bit more nuanced. If we concentrate only on the DSA tag, what I said above is true, but Ocelot/Seville also support an optional DSA tag prefix, which can be short or long, and it is possible to distinguish the two taggers based on an integer constant put in that prefix. Nonetheless, creating a separate tagger is still justified, since the tag prefix is optional, and without it, there is again no way to distinguish. Claiming backwards binary compatibility is a bit more tough, since I've already changed the format of tag_ocelot once, in commit 5124197ce58b ("net: dsa: tag_ocelot: use a short prefix on both ingress and egress"). Therefore I am not very concerned with treating this as a bugfix and backporting it to stable kernels (which would be another mess due to the fact that there would be lots of conflicts with the other DSA_TAG_PROTO* definitions). It's just simpler to say that the string values of the taggers have ABI value starting with kernel 5.12, which will be when the changing of tag protocol via /sys/class/net/<dsa-master>/dsa/tagging goes live. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: dsa: tag_ocelot: single out PTP-related transmit tag processingVladimir Oltean1-11/+21
There is one place where we cannot avoid accessing driver data, and that is 2-step PTP TX timestamping, since the switch wants us to provide a timestamp request ID through the injection header, which naturally must come from a sequence number kept by the driver (it is generated by the .port_txtstamp method prior to the tagger's xmit). However, since other drivers like dsa_loop do not claim PTP support anyway, the DSA_SKB_CB(skb)->clone will always be NULL anyway, so if we move all PTP-related dereferences of struct ocelot and struct ocelot_port into a separate function, we can effectively ensure that this is dead code when the ocelot tagger is attached to non-ocelot switches, and the stateful portion of the tagger is more self-contained. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: use common tag parsing code with DSAVladimir Oltean9-227/+246
The Injection Frame Header and Extraction Frame Header that the switch prepends to frames over the NPI port is also prepended to frames delivered over the CPU port module's queues. Let's unify the handling of the frame headers by making the ocelot driver call some helpers exported by the DSA tagger. Among other things, this allows us to get rid of the strange cpu_to_be32 when transmitting the Injection Frame Header on ocelot, since the packing API uses network byte order natively (when "quirks" is 0). The comments above ocelot_gen_ifh talk about setting pop_cnt to 3, and the cpu extraction queue mask to something, but the code doesn't do it, so we don't do it either. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: dsa: tag_ocelot: avoid accessing ds->priv in ocelot_rcvVladimir Oltean1-4/+3
Taggers should be written to do something valid irrespective of the switch driver that they are attached to. This is even more true now, because since the introduction of the .change_tag_protocol method, a certain tagger is not necessarily strictly associated with a driver any longer, and I would like to be able to test all taggers with dsa_loop in the future. In the case of ocelot, it needs to move the classified VLAN from the DSA tag into the skb if the port is VLAN-aware. We can allow it to do that by looking at the dp->vlan_filtering property, no need to invoke structures which are specific to ocelot. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: refactor ocelot_port_inject_frame out of ocelot_port_xmitVladimir Oltean3-74/+109
The felix DSA driver will inject some frames through register MMIO, same as ocelot switchdev currently does. So we need to be able to reuse the common code. Also create some shim definitions, since the DSA tagger can be compiled without support for the switch driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: use DIV_ROUND_UP helper in ocelot_port_inject_frameVladimir Oltean1-1/+1
This looks a bit nicer than the open-coded "(x + 3) % 4" idiom. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: better error handling in ocelot_xtr_irq_handlerVladimir Oltean1-10/+12
The ocelot_rx_frame_word() function can return a negative error code, however this isn't being checked for consistently. Errors being ignored have not been seen in practice though. Also, some constructs can be simplified by using "goto" instead of repeated "break" statements. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: only drain extraction queue on errorVladimir Oltean1-1/+1
It appears that the intention of this snippet of code is to not exit ocelot_xtr_irq_handler() while in the middle of extracting a frame. The problem in extracting it word by word is that future extraction attempts are really easy to get desynchronized, since the IRQ handler assumes that the first 16 bytes are the IFH, which give further information about the frame, such as frame length. But during normal operation, "err" will not be 0, but 4, set from here: for (i = 0; i < OCELOT_TAG_LEN / 4; i++) { err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]); if (err != 4) break; } if (err != 4) break; In that case, draining the extraction queue is a no-op. So explicitly make this code execute only on negative err. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handlerVladimir Oltean1-5/+2
Since the xtr (extraction) IRQ of the ocelot switch is not shared, then if it fired, it means that some data must be present in the queues of the CPU port module. So simplify the code. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15Merge branch 'bnxt_en-next'David S. Miller3-61/+290
Michael Chan says: ==================== bnxt_en: Error recovery optimizations. This series implements some optimizations to error recovery. One patch adds an echo/reply mechanism with firmware to enhance error detection. The other patches speed up the recovery process by polling config space earlier and to selectively initialize context memory during re-initialization. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Improve logging of error recovery settings information.Michael Chan1-7/+8
We currently only log the error recovery settings if it is enabled. In some cases, firmware disables error recovery after it was initially enabled. Without logging anything, the user will not be aware of this change in setting. Log it when error recovery is disabled. Also, change the reset count value from hexadecimal to decimal. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Reply to firmware's echo request async message.Michael Chan2-0/+32
This is a new async message that the firmware can send to check if it can communicate with the driver. This is an added error detection scheme that firmware can use if it suspects errors in the PCIe interface. When the driver receives this async message, it will reply back echoing some data in the async message. If the firmware is not getting the reply with the proper data after some retries, error recovery will kick in. Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Initialize "context kind" field for context memory blocks.Michael Chan1-5/+42
If firmware provides the offset to the "context kind" field of the relevant context memory blocks, we'll initialize just that field for each block instead of initializing all of context memory. Populate the bnxt_mem_init structure with the proper offset returned by firmware. If it is older firmware and the information is not available, we set the offset to an invalid value and fall back to the old behavior of initializing every byte. Otherwise, we initialize only the "context kind" byte at the offset. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Add context memory initialization infrastructure.Michael Chan2-18/+53
Currently, the driver calls memset() to set all relevant context memory used by the chip to the initial value. This can take many milliseconds with the potentially large number of context pages allocated for the chip. To make this faster, we only need to initialize the "context kind" field of each block of context memory. This patch sets up the infrastructure to do that with the bnxt_mem_init structure. In the next patch, we'll add the logic to obtain the offset of the "context kind" from the firmware. This patch is not changing the current behavior of calling memset() to initialize all relevant context memory. Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Implement faster recovery for firmware fatal error.Michael Chan1-0/+19
During some fatal firmware error conditions, the PCI config space register 0x2e which normally contains the subsystem ID will become 0xffff. This register will revert back to the normal value after the chip has completed core reset. If we detect this condition, we can poll this config register immediately for the value to revert. Because we use config read cycles to poll this register, there is no possibility of Master Abort if we happen to read it during core reset. This speeds up recovery significantly as we don't have to wait for the conservative min_time before polling MMIO to see if the firmware has come out of reset. As soon as this register changes value we can proceed to re-initialize the device. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: selectively allocate context memoriesEdwin Peer1-32/+52
Newer devices may have local context memory instead of relying on the host for backing store. In these cases, HWRM_FUNC_BACKING_STORE_QCAPS will return a zero entry size to indicate contexts for which the host should not allocate backing store. Selectively allocate context memory based on device capabilities and only enable backing store for the appropriate contexts. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15bnxt_en: Update firmware interface spec to 1.10.2.16.Michael Chan2-11/+96
The main changes are the echo request/response from firmware for error detection and the NO_FCS feature to transmit frames without FCS. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14Merge branch 'skbuff-introduce-skbuff_heads-bulking-and-reusing'David S. Miller3-206/+242
Alexander Lobakin says: ==================== skbuff: introduce skbuff_heads bulking and reusing Currently, all sorts of skb allocation always do allocate skbuff_heads one by one via kmem_cache_alloc(). On the other hand, we have percpu napi_alloc_cache to store skbuff_heads queued up for freeing and flush them by bulks. We can use this cache not only for bulk-wiping, but also to obtain heads for new skbs and avoid unconditional allocations, as well as for bulk-allocating (like XDP's cpumap code and veth driver already do). As this might affect latencies, cache pressure and lots of hardware and driver-dependent stuff, this new feature is mostly optional and can be issued via: - a new napi_build_skb() function (as a replacement for build_skb()); - existing {,__}napi_alloc_skb() and napi_get_frags() functions; - __alloc_skb() with passing SKB_ALLOC_NAPI in flags. iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger on more powerful hosts and NICs with tens of Mpps. Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs: - kmalloc()/kmem_cache_alloc() itself allows by default allocating memory from the remote nodes to defragment their slabs. This is controlled by sysctl, but according to this, skbuff_head from a remote node is an OK case; - The easiest way to check if the slab of skbuff_head is remote or pfmemalloc'ed is: if (!dev_page_is_reusable(virt_to_head_page(skb))) /* drop it */; ...*but*, regarding that most slabs are built of compound pages, virt_to_head_page() will hit unlikely-branch every single call. This check costed at least 20 Mbps in test scenarios and seems like it'd be better to _not_ do this. Since v5 [4]: - revert flags-to-bool conversion and simplify flags testing in __alloc_skb() (Alexander Duyck). Since v4 [3]: - rebase on top of net-next and address kernel build robot issue; - reorder checks a bit in __alloc_skb() to make new condition even more harmless. Since v3 [2]: - make the feature mostly optional, so driver developers could decide whether to use it or not (Paolo Abeni). This reuses the old flag for __alloc_skb() and introduces a new napi_build_skb(); - reduce bulk-allocation size from 32 to 16 elements (also Paolo). This equals to the value of XDP's devmap and veth batch processing (which were tested a lot) and should be sane enough; - don't waste cycles on explicit in_serving_softirq() check. Since v2 [1]: - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy after the changes that pass tiny skbs requests to kmalloc layer); - cover the cache with KASAN instrumentation (suggested by Eric Dumazet, help of Dmitry Vyukov); - completely drop redundant __kfree_skb_flush() (also Eric); - lots of code cleanups; - expand the commit message with NUMA and pfmemalloc points (Jakub). Since v1 [0]: - use one unified cache instead of two separate to greatly simplify the logics and reduce hotpath overhead (Edward Cree); - new: recycle also GRO_MERGED_FREE skbs instead of immediate freeing; - correct performance numbers after optimizations and performing lots of tests for different use cases. [0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@pm.me [1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@pm.me [2] https://lore.kernel.org/netdev/20210209204533.327360-1-alobakin@pm.me [3] https://lore.kernel.org/netdev/20210210162732.80467-1-alobakin@pm.me [4] https://lore.kernel.org/netdev/20210211185220.9753-1-alobakin@pm.me ==================== Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeingAlexander Lobakin3-11/+11
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: allow to use NAPI cache from __napi_alloc_skb()Alexander Lobakin1-2/+3
{,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: allow to optionally use NAPI cache from __alloc_skb()Alexander Lobakin1-1/+5
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache headsAlexander Lobakin2-13/+83
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: move NAPI cache declarations upper in the fileAlexander Lobakin1-45/+45
NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: remove __kfree_skb_flush()Alexander Lobakin3-19/+1
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: use __build_skb_around() in __alloc_skb()Alexander Lobakin1-17/+1
Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: simplify __alloc_skb() a bitAlexander Lobakin1-6/+5
Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: make __build_skb_around() return voidAlexander Lobakin1-7/+6
__build_skb_around() can never fail and always returns passed skb. Make it return void to simplify and optimize the code. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: simplify kmalloc_reserve()Alexander Lobakin1-5/+2
Eversince the introduction of __kmalloc_reserve(), "ip" argument hasn't been used. _RET_IP_ is embedded inside kmalloc_node_track_caller(). Remove the redundant macro and rename the function after it. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14skbuff: move __alloc_skb() next to the other skb allocation functionsAlexander Lobakin1-142/+142
In preparation before reusing several functions in all three skb allocation variants, move __alloc_skb() next to the __netdev_alloc_skb() and __napi_alloc_skb(). No functional changes. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13Merge branch 'Xilinx-axienet-updates'David S. Miller3-18/+83
Robert Hancock says: ==================== Xilinx axienet updates Updates to the Xilinx AXI Ethernet driver to add support for an additional ethtool operation, and to support dynamic switching between 1000BaseX and SGMII interface modes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: axienet: Support dynamic switching between 1000BaseX and SGMIIRobert Hancock2-18/+71
Newer versions of the Xilinx AXI Ethernet core (specifically version 7.2 or later) allow the core to be configured with a PHY interface mode of "Both", allowing either 1000BaseX or SGMII modes to be selected at runtime. Add support for this in the driver to allow better support for applications which can use both fiber and copper SFP modules. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13dt-bindings: net: xilinx_axienet: add xlnx,switch-x-sgmii attributeRobert Hancock1-0/+4
Document the new xlnx,switch-x-sgmii attribute which is used to indicate that the Ethernet core supports dynamic switching between 1000BaseX and SGMII. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: axienet: hook up nway_reset ethtool operationRobert Hancock1-0/+8
Hook up the nway_reset ethtool operation to the corresponding phylink function so that "ethtool -r" can be supported. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13Merge branch 'tcp-mem-pressure-vs-SO_RCVLOWAT'David S. Miller3-24/+26
Eric Dumazet says: ==================== tcp: mem pressure vs SO_RCVLOWAT First patch fixes an issue for applications using SO_RCVLOWAT to reduce context switches. Second patch is a cleanup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13tcp: factorize logic into tcp_epollin_ready()Eric Dumazet3-22/+19
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic. Add tcp_epollin_ready() helper to avoid duplication. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13tcp: fix SO_RCVLOWAT related hangs under mem pressureEric Dumazet1-2/+7
While commit 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint, it missed to address issue caused by memory pressure. 1) If we are under memory pressure and socket receive queue is empty. First incoming packet is allowed to be queued, after commit 76dfa6082032 ("tcp: allow one skb to be received per socket under memory pressure") But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat is bigger than skb length. 2) Then, when next packet comes, it is dropped, and we directly call sk->sk_data_ready(). 3) If application is using poll(), tcp_poll() will then use tcp_stream_is_readable() and decide the socket receive queue is not yet filled, so nothing will happen. Even when sender retransmits packets, phases 2) & 3) repeat and flow is effectively frozen, until memory pressure is off. Fix is to consider tcp_under_memory_pressure() to take care of global memory pressure or memcg pressure. Fixes: 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Arjun Roy <arjunroy@google.com> Suggested-by: Wei Wang <weiwan@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13Merge branch '40GbE' of ↵David S. Miller10-136/+86
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 40GbE Intel Wired LAN Driver Updates 2021-02-12 This series contains updates to i40e, ice, and ixgbe drivers. Maciej does cleanups on the following drivers. For i40e, removes redundant check for XDP prog, cleans up no longer relevant information, and removes an unused function argument. For ice, removes local variable use, instead returning values directly. Moves skb pointer from buffer to ring and removes an unneeded check for xdp_prog in zero copy path. Also removes a redundant MTU check when changing it. For i40e, ice, and ixgbe, stores the rx_offset in the Rx ring as the value is constant so there's no need for continual calls. Bjorn folds a decrement into a while statement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13Merge branch 'tc-mpls-selftests'David S. Miller3-1/+348
Guillaume Nault says: ==================== selftests: tc: Test tc-flower's MPLS features A couple of patches for exercising the MPLS filters of tc-flower. Patch 1 tests basic MPLS matching features: those that only work on the first label stack entry (that is, the mpls_label, mpls_tc, mpls_bos and mpls_ttl options). Patch 2 tests the more generic "mpls" and "lse" options, which allow matching MPLS fields beyond the first stack entry. In both patches, special care is taken to skip these new tests for incompatible versions of tc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13selftests: tc: Add generic mpls matching support for tc-flowerGuillaume Nault2-1/+162
Add tests in tc_flower.sh for generic matching on MPLS Label Stack Entries. The label, tc, bos and ttl fields are tested for the first and second labels. For each field, the minimal and maximal values are tested (the former at depth 1 and the later at depth 2). There are also tests for matching the presence of a label stack entry at a given depth. In order to reduce the amount of code, all "lse" subcommands are tested in match_mpls_lse_test(). Action "continue" is used, so that test packets are evaluated by all filters. Then, we can verify if each filter matched the expected number of packets. Some versions of tc-flower produced invalid json output when dumping MPLS filters with depth > 1. Skip the test if tc isn't recent enough. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13selftests: tc: Add basic mpls_* matching support for tc-flowerGuillaume Nault3-1/+187
Add tests in tc_flower.sh for mpls_label, mpls_tc, mpls_bos and mpls_ttl. For each keyword, test the minimal and maximal values. Selectively skip these new mpls tests for tc versions that don't support them. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13Merge branch 'brport-flags'David S. Miller29-331/+889
Vladimir Oltean says: ==================== Cleanup in brport flags switchdev offload for DSA The initial goal of this series was to have better support for standalone ports mode on the DSA drivers like ocelot/felix and sja1105. This turned out to require some API adjustments in both directions: to the information presented to and by the switchdev notifier, and to the API presented to the switch drivers by the DSA layer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: dsa: sja1105: offload bridge port flags to deviceVladimir Oltean3-11/+219
The chip can configure unicast flooding, broadcast flooding and learning. Learning is per port, while flooding is per {ingress, egress} port pair and we need to configure the same value for all possible ingress ports towards the requested one. While multicast flooding is not officially supported, we can hack it by using a feature of the second generation (P/Q/R/S) devices, which is that FDB entries are maskable, and multicast addresses always have an odd first octet. So by putting a match-all for 00:01:00:00:00:00 addr and 00:01:00:00:00:00 mask at the end of the FDB, we make sure that it is always checked last, and does not take precedence in front of any other MDB. So it behaves effectively as an unknown multicast entry. For the first generation switches, this feature is not available, so unknown multicast will always be treated the same as unknown unicast. So the only thing we can do is request the user to offload the settings for these 2 flags in tandem, i.e. ip link set swp2 type bridge_slave flood off Error: sja1105: This chip cannot configure multicast flooding independently of unicast. ip link set swp2 type bridge_slave flood off mcast_flood off ip link set swp2 type bridge_slave mcast_flood on Error: sja1105: This chip cannot configure multicast flooding independently of unicast. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: mscc: ocelot: offload bridge port flags to deviceVladimir Oltean4-5/+158
We should not be unconditionally enabling address learning, since doing that is actively detrimential when a port is standalone and not offloading a bridge. Namely, if a port in the switch is standalone and others are offloading the bridge, then we could enter a situation where we learn an address towards the standalone port, but the bridged ports could not forward the packet there, because the CPU is the only path between the standalone and the bridged ports. The solution of course is to not enable address learning unless the bridge asks for it. We need to set up the initial port flags for no learning and flooding everything, and also when the port joins and leaves the bridge. The flood configuration was already configured ok for standalone mode in ocelot_init, we just need to disable learning in ocelot_init_port. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: mscc: ocelot: use separate flooding PGID for broadcastVladimir Oltean3-12/+18
In preparation of offloading the bridge port flags which have independent settings for unknown multicast and for broadcast, we should also start reserving one destination Port Group ID for the flooding of broadcast packets, to allow configuring it individually. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13net: dsa: felix: restore multicast flood to CPU when NPI tagger reinitializesVladimir Oltean1-0/+1
ocelot_init sets up PGID_MC to include the CPU port module, and that is fine, but the ocelot-8021q tagger removes the CPU port module from the unknown multicast replicator. So after a transition from the default ocelot tagger towards ocelot-8021q and then again towards ocelot, multicast flooding towards the CPU port module will be disabled. Fixes: e21268efbe26 ("net: dsa: felix: perform switch setup for tag_8021q") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>