summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx4/en_rx.c
AgeCommit message (Collapse)AuthorFilesLines
2014-12-11net/mlx4: Add A0 hybrid steeringMatan Barak1-1/+2
A0 hybrid steering is a form of high performance flow steering. By using this mode, mlx4 cards use a fast limited table based steering, in order to enable fast steering of unicast packets to a QP. In order to implement A0 hybrid steering we allocate resources from different zones: (1) General range (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region. When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP, we try hard to allocate the QP from range (2). Otherwise, we try hard not to allocate from this range. However, when the system is pushed to its limits and one needs every resource, the allocator uses every region it can. Meaning, when we run out of raw-eth qps, the allocator allocates from the general range (and the special-A0 area is no longer active). If we run out of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that is also exhausted, the allocator will allocate from the general range (and the A0 region is no longer active). Note that if a raw-eth qp is allocated from the general range, it attempts to allocate the range such that bits 6 and 7 (blueflame bits) in the QP number are not set. When the feature is used in SRIOV, the VF has to notify the PF what kind of QP attributes it needs. In order to do that, along with the "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According to the combination of these bits, the PF tries to allocate a suitable QP. In order to maintain backward compatibility (with older PFs), the PF notifies which QP attributes it supports via QUERY_FUNC_CAP command. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Change QP allocation schemeEugenia Emantayev1-2/+2
When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset. The current Ethernet driver code reserves a Tx QP range with 256b alignment. This is wrong because if there are more than 64 Tx QPs in use, QPNs >= base + 65 will have bits 6/7 set. This problem is not specific for the Ethernet driver, any entity that tries to reserve more than 64 BF-enabled QPs should fail. Also, using ranges is not necessary here and is wasteful. The new mechanism introduced here will support reservation for "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs (when hypervisors support WC in VMs). The flow we use is: 1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation, and request "BF enabled QPs" if BF is supported for the function 2. In the ALLOC_RES FW command, change param1 to: a. param1[23:0] - number of QPs b. param1[31-24] - flags controlling QPs reservation Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits 6 and 7 unset in order to be used in Ethernet. Bits 24-30 of the flags are currently reserved. When a function tries to allocate a QP, it states the required attributes for this QP. Those attributes are considered "best-effort". If an attribute, such as Ethernet BF enabled QP, is a must-have attribute, the function has to check that attribute is supported before trying to do the allocation. In a lower layer of the code, mlx4_qp_reserve_range masks out the bits which are unsupported. If SRIOV is used, the PF validates those attributes and masks out unsupported attributes as well. In order to notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's mailbox is filled by the PF, which notifies which QP allocation attributes it supports. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4_en: Set csum level for encapsulated packetsOr Gerlitz1-1/+2
This was dropped by mistake for the napi_gro_frags flow, fix that. Fixes: dd65beac48a5 ('net/mlx4_en: Extend usage of napi_gro_frags') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09net/mlx4_en: Support for configurable RSS hash functionEyal Perry1-1/+13
The ConnectX HW is capable of using one of the following hash functions: Toeplitz and an XOR hash function. This patch extends the implementation of the mlx4_en driver set/get_rxfh callbacks to support getting and setting the RSS hash function used by the device. Signed-off-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-23mlx4: fix mlx4_en_set_rxfh()Eric Dumazet1-2/+1
mlx4_en_set_rxfh() can crash if no RSS indir table is provided. While we are at it, allow RSS key to be changed with ethtool -X Tested: myhost:~# cat /proc/sys/net/core/netdev_rss_key b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e:c5:5d:4d:8c:22:67:30:ab:8a:6e:c3:6a myhost:~# ethtool -x eth0 RX flow hash indirection table for eth0 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 RSS hash key: b6:89:91:f3:b2:c3:c2:90:11:e8:ce:45:e8:a9:9d:1c:f2:f6:d4:53:61:8b:26:3a:b3:9a:57:97:c3:b6:79:4d:2e:d9:66:5c:72:ed:b6:8e myhost:~# ethtool -X eth0 hkey \ 03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f myhost:~# ethtool -x eth0 RX flow hash indirection table for eth0 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 RSS hash key: 03:0e:e2:43:fa:82:0e:73:14:2d:c0:68:21:9e:82:99:b9:84:d0:22:e2:b3:64:9f:4a:af:00:fa:cc:05:b4:4a:17:05:14:73:76:58:bd:2f Reported-by: Ben Hutchings <ben@decadent.org.uk> Fixes: b9d1ab7eb42e ("mlx4: use netdev_rss_key_fill() helper") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-16mlx4: use netdev_rss_key_fill() helperEric Dumazet1-5/+1
Use of well known RSS key increases attack surface. Switch to a random one, using generic helper so that all ports share a common key. Also provide ethtool -x support to fetch RSS key Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETEShani Michaeli1-6/+120
When processing received traffic, pass CHECKSUM_COMPLETE status to the stack, with calculated checksum for non TCP/UDP packets (such as GRE or ICMP). Although the stack expects checksum which doesn't include the pseudo header, the HW adds it. To address that, we are subtracting the pseudo header checksum from the checksum value provided by the HW. In the IPv6 case, we also compute/add the IP header checksum which is not added by the HW for such packets. Cc: Jerry Chu <hkchu@google.com> Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11net/mlx4_en: Extend usage of napi_gro_fragsShani Michaeli1-54/+54
We can call napi_gro_frags for all the received traffic regardless of the checksum status. Specifically, received packets whose status is CHECKSUM_NONE (and soon to be added CHECKSUM_COMPLETE) are eligible for napi_gro_frags as well. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Shani Michaeli <shanim@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11mlx4: restore conditional call to napi_complete_done()Eric Dumazet1-7/+8
After commit 1a28817282 ("mlx4: use napi_complete_done()") we ended up calling napi_complete_done() in the case NAPI poll consumed all its budget. This added extra interrupt pressure, this patch restores proper behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 1a28817282 ("mlx4: use napi_complete_done()") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10mlx4: use napi_complete_done()Eric Dumazet1-7/+4
To enable gro_flush_timeout, a driver has to use napi_complete_done() instead of napi_complete(). Tested: Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues) Without this feature, we send back about 305,000 ACK per second. GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet) Setting a timer of 2000 nsec is enough to increase GRO packet sizes and reduce number of ACK packets. (811/19.2 = 42) Receiver performs less calls to upper stacks, less wakes up. This also reduces cpu usage on the sender, as it receives less ACK packets. Note that reducing number of wakes up increases cpu efficiency, but can decrease QPS, as applications wont have the chance to warmup cpu caches doing a partial read of RPC requests/answers if they fit in one skb. B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00 0.00 0.50 B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00 0.00 0.50 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pagesIdo Shamay1-3/+4
Needed in order to get cache cold pages (L3 flushed) for HW scatter. Otherwise memory may flush those entries when the packet comes from PCI, causing back pressure resulting in BW decrease. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03net/mlx4_en: Remove RX buffers alignment to IP_ALIGNIdo Shamay1-12/+4
When IP_ALIGN has a non zero value, hardware will write to a non aligned address. The only reader from this address is when copying the header from the first frag into the linear buffer (further access to the IP address will be from the linear buffer, in which the headers are aligned). Since the penalty of non align access by the hardware is greater than the software memcpy, changing the frag_align to always be 0. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30mlx4: use napi_schedule_irqoff()Eric Dumazet1-2/+2
mlx4_en_rx_irq() and mlx4_en_tx_irq() run from hard interrupt context. They can use napi_schedule_irqoff() instead of napi_schedule() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-By: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29net/mlx4_en: Cleanups suggested by clang static checkerJack Morgenstein1-1/+0
clang flagged the following. All are actually cosmetic cleanups, not really bugs: drivers/net/ethernet/mellanox/mlx4/en_main.c:233:3: warning: Value stored to 'err' is never read err = -ENOMEM; ^ ~~~~~~~ drivers/net/ethernet/mellanox/mlx4/en_main.c:293:3: warning: Value stored to 'err' is never read err = -ENOMEM; drivers/net/ethernet/mellanox/mlx4/en_netdev.c:648:16: warning: Assigned value is garbage or undefined entry->reg_id = reg_id; ^ ~~~~~~ drivers/net/ethernet/mellanox/mlx4/en_netdev.c:659:2: warning: Function call argument is an uninitialized value mlx4_en_uc_steer_release(priv, priv->dev->dev_addr, *qpn, reg_id); (NOTE: reg_id is only used in the device-managed flow steering path, in which is it always initialized. This is not a bug. Cleanup here is therefore cosmetic only). drivers/net/ethernet/mellanox/mlx4/en_rx.c:122:3: warning: Value stored to 'frag_info' is never read frag_info = &priv->frag_info[i]; ^ ~~~~~~~~~~~~~~~~~~~ Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10mlx4: fix race accessing page->_countEric Dumazet1-3/+3
This is illegal to use atomic_set(&page->_count, ...) even if we 'own' the page. Other entities in the kernel need to use get_page_unless_zero() to get a reference to the page before testing page properties, so we could loose a refcount increment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-20net/mlx4_en: Add mlx4_en_get_cqe helperIdo Shamay1-2/+2
This function derives the base address of the CQE from the CQE size, and calculates the real CQE context segment in it from the factor (this is like before). Before this change the code used the factor to calculate the base address of the CQE as well. The factor indicates in which segment of the cqe stride the cqe information is located. For 32-byte strides, the segment is 0, and for 64 byte strides, the segment is 1 (bytes 32..63). Using the factor was ok as long as we had only 32 and 64 byte strides. However, with larger strides, the factor is zero, and so cannot be used to calculate the base of the CQE. The helper uses the same method of CQE buffer pulling made by all other components that reads the CQE buffer (mlx4_ib driver and libmlx4). Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-06mlx4: only pull headers into skb headEric Dumazet1-5/+8
Use the new fancy eth_get_headlen() to pull exactly the headers into skb->head. This speeds up GRE traffic (or more generally tunneled traffuc), as GRO can aggregate up to 17 MSS per GRO packet instead of 8. (Pulling too much data was forcing GRO to keep 2 frags per MSS) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-30mlx4: Set skb->csum_level for encapsulated checksumTom Herbert1-3/+3
Set skb->csum_level instead of skb->encapsulation when indicating CHECKSUM_UNNECESSARY for an encapsulated checksum. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-23net/mlx4_en: Reduce memory consumption on kdump kernelAmir Vadai1-2/+3
When memory is limited, reduce number of rx and tx rings. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+14
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15mlx4: mark napi id for gro_skbJason Wang1-0/+1
Napi id was not marked for gro_skb, this will lead rx busy loop won't work correctly since they stack never try to call low latency receive method because of a zero socket napi id. Fix this by marking napi id for gro_skb. The transaction rate of 1 byte netperf tcp_rr gets about 50% increased (from 20531.68 to 30610.88). Cc: Amir Vadai <amirv@mellanox.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-09net/mlx4_en: Do not count LLC/SNAP in MTU calculationYishai Hadas1-1/+1
LLC/SNAP 8 bytes should not be added as part of header calculation. If used, payload will be decreased accordingly. For MTU of 1500 we'll set 1522 instead of 1523. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Liran Liss <liranl@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-03net/mlx4_en: Don't use irq_affinity_notifier to track changes in IRQ ↵Amir Vadai1-3/+13
affinity map IRQ affinity notifier can only have a single notifier - cpu_rmap notifier. Can't use it to track changes in IRQ affinity map. Detect IRQ affinity changes by comparing CPU to current IRQ affinity map during NAPI poll thread. CC: Thomas Gleixner <tglx@linutronix.de> CC: Ben Hutchings <ben@decadent.org.uk> Fixes: 2eacc23 ("net/mlx4_core: Enforce irq affinity changes immediatly") Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-12/+16
Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...
2014-06-03IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIOJiri Kosina1-3/+3
Modify the various routines used to allocate memory resources which serve QPs in mlx4 to get an input GFP directive. Have the Ethernet driver to use GFP_KERNEL in it's QP allocations as done prior to this commit, and the IB driver to use GFP_NOIO when the IB verbs IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-05-14net/mlx4_core: Enforce irq affinity changes immediatlyYuval Atias1-2/+9
During heavy traffic, napi is constatntly polling the complition queue and no interrupt is fired. Because of that, changes to irq affinity are ignored until traffic is stopped and resumed. By registering to the irq notifier mechanism, and forcing interrupt when affinity is changed, irq affinity changes will be immediatly enforced. Signed-off-by: Yuval Atias <yuvala@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-09mellanox: Logging message cleanupsJoe Perches1-10/+7
Use a more current logging style. o Coalesce formats o Add missing spaces for coalesced formats o Align arguments for modified formats o Add missing newlines for some logging messages o Use DRV_NAME as part of format instead of %s, DRV_NAME to reduce overall text. o Use ..., ##__VA_ARGS__ instead of args... in macros o Correct a few format typos o Use a single line message where appropriate Signed-off-by: Joe Perches <joe@perches.com> Acked-By: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-15mlx4: Don't receive packets when the napi budget == 0Eric W. Biederman1-0/+3
Processing any incoming packets with a with a napi budget of 0 is incorrect driver behavior. This matters as netpoll will shortly call drivers with a budget of 0 to avoid receive packet processing happening in hard irq context. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-25net/mlx4: Fix limiting number of IRQ's instead of RSS queuesIdo Shamay1-1/+4
This fix a performance bug introduced by commit 90b1ebe "mlx4: set maximal number of default RSS queues", which limits the numbers of IRQs opened by core module. The limit should be on the number of queues in the indirection table - rx_rings, and not on the number of IRQ's. Also, limiting on mlx4_core initialization instead of in mlx4_en, prevented using "ethtool -L" to utilize all the CPU's, when performance mode is prefered, since limiting this number to 8 reduces overall packet rate by 15%-50% in multiple TCP streams applications. For example, after running ethtool -L <ethx> rx 16 Packet rate Before the fix 897799 After the fix 1142070 Results were obtained using netperf: S=200 ; ( for i in $(seq 1 $S) ; do ( \ netperf -H 11.7.13.55 -t TCP_RR -l 30 &) ; \ wait ; done | grep "1 1" | awk '{SUM+=$6} END {print SUM}' ) CC: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-25net/mlx4: Set number of RX rings in a utility functionIdo Shamay1-0/+22
mlx4_en_add() is too long. Moving set number of RX rings to a utiltity function to improve readability and modulization of the code. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net/mlx4_en: call gro handler for encapsulated framesEric Dumazet1-3/+5
In order to use the native GRO handling of encapsulated protocols on mlx4, we need to call napi_gro_receive() instead of netif_receive_skb() unless busy polling is in action. While we are at it, rename mlx4_en_cq_ll_polling() to mlx4_en_cq_busy_polling() Tested with GRE tunnel : GRO aggregation is now performed on the ethernet device instead of being done later on gre device. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Cc: Jerry Chu <hkchu@google.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Acked-By: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunnelingOr Gerlitz1-0/+14
When the device tunneling offloads mode is vxlan do the following - call SET_PORT with the relevant setting - add DMFS steering vxlan rule for the device self and multicast mac addresses of the form: {<ETH, outer-mac> <VXLAN, ANY vnid> <ETH, ANY mac>} --> RSS QP - set relevant QPC fields in RSS context and RX ring QPs - in TX flow, set WQE fields to generate HW checksum, and handle gso skbs which are marked for encapsulation such that the HW will segment them properly. - in RX flow, read HW offloaded checksum for encapsulated packets from the CQE - advertize hw_enc_features and NETIF_F_GSO_UDP_TUNNEL to the networking stack Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19net: mlx4 calls skb_set_hashTom Herbert1-2/+6
Drivers should call skb_set_hash to set the hash and its type in an skbuff. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08net/mlx4_en: Datapath structures are allocated per NUMA nodeEugenia Emantayev1-7/+16
For each RX/TX ring and its CQ, allocation is done on a NUMA node that corresponds to the core that the data structure should operate on. The assumption is that the core number is reflected by the ring index. The affected allocations are the ring/CQ data structures, the TX/RX info and the shared HW/SW buffer. For TX rings, each core has rings of all UPs. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08net/mlx4_en: Datapath resources allocated dynamicallyEugenia Emantayev1-19/+38
Currently all TX/RX rings and completion queues are part of the netdev priv structure and are allocated statically. This patch will change the priv to hold only arrays of pointers and therefore all TX/RX rings and completetion queues will be allocated dynamically. This is in preparation for NUMA aware allocations. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09net/mlx4_en: Fix pages never dma unmapped on rxAmir Vadai1-2/+3
This patch fixes a bug introduced by commit 51151a16 (mlx4: allow order-0 memory allocations in RX path). dma_unmap_page never reached because condition to detect last fragment in page is wrong. offset+frag_stride can't be greater than size, need to make sure no additional frag will fit in page => compare offset + frag_stride + next_frag_size instead. next_frag_size is the same as the current one, since page is shared only with frags of the same size. CC: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09net/mlx4_en: Rename name of mlx4_en_rx_alloc membersAmir Vadai1-17/+23
Add page prefix to page related members: @size and @offset into @page_size and @page_offset CC: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11net: rename ll methods to busy-pollEliezer Tamir1-1/+1
Rename ndo_ll_poll to ndo_busy_poll. Rename sk_mark_ll to sk_mark_napi_id. Rename skb_mark_ll to skb_mark_napi_id. Correct all useres of these functions. Update comments and defines in include/net/busy_poll.h Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11net: rename include/net/ll_poll.h to include/net/busy_poll.hEliezer Tamir1-1/+1
Rename the file and correct all the places where it is included. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26mlx4: allow order-0 memory allocations in RX pathEric Dumazet1-80/+89
Signed-off-by: Eric Dumazet <edumazet@google.com> mlx4 exclusively uses order-2 allocations in RX path, which are likely to fail under memory pressure. We therefore drop frames more than needed. This patch tries order-3, order-2, order-1 and finally order-0 allocations to keep good performance, yet allow allocations if/when memory gets fragmented. By using larger pages, and avoiding unnecessary get_page()/put_page() on compound pages, this patch improves performance as well, lowering false sharing on struct page. Also use GFP_KERNEL allocations in initialization path, as allocating 12 MB (390 order-3 pages) can easily fail with GFP_ATOMIC. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amir Vadai <amirv@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-20net/mlx4_en: Add Low Latency Socket (LLS) supportAmir Vadai1-2/+13
Add basic support for LLS. Signed-off-by: Amir Vadai <amirv@mellanox.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25net/mlx4_en: Add HW timestamping (TS) supportAmir Vadai1-5/+24
The patch allows to enable/disable HW timestamping for incoming and/or outgoing packets. It adds and initializes all structs and callbacks needed by kernel TS API. To enable/disable HW timestamping appropriate ioctl should be used. Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are supported. When enabling TS on receive flow - VLAN stripping will be disabled. Also were made all relevant changes in RX/TX flows to consider TS request and plant HW timestamps into relevant structures. mlx4_ib was fixed to compile with new mlx4_cq_alloc() signature. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19net: vlan: add protocol argument to packet tagging functionsPatrick McHardy1-2/+2
Add a protocol argument to the VLAN packet tagging functions. In case of HW tagging, we need that protocol available in the ndo_start_xmit functions, so it is stored in a new field in the skb. The new field fits into a hole (on 64 bit) and doesn't increase the sks's size. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-28hlist: drop the node parameter from iteratorsSasha Levin1-2/+2
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-09drivers: net: Remove remaining alloc/OOM messagesJoe Perches1-4/+2
alloc failures already get standardized OOM messages and a dump_stack. For the affected mallocs around these OOM messages: Converted kmallocs with multiplies to kmalloc_array. Converted a kmalloc/memcpy to kmemdup. Removed now unused stack variables. Removed unnecessary parentheses. Neatened alignment. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Arend van Spriel <arend@broadcom.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-08net/mlx4_en: Manage hash of MAC addresses per portYan Burman1-4/+19
As a preparation step for supporting multiple unicast addresses, store MAC addresses in hash table. Remove the radix tree for MAC addresses per QP, as it's not in use. Signed-off-by: Yan Burman <yanb@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-08net/mlx4_en: Optimize Rx fast path filter checksYan Burman1-3/+2
Currently, RX path code that does RX filtering is not optimized and does an expensive conversion. In order to use ether_addr_equal_64bits which is optimized for such cases, we need the MAC address kept by the device to be in the form of unsigned char array instead of u64. Store the MAC address as unsigned char array and convert to/from u64 out of the fast path when needed. Side effect of this is that we no longer need priv->mac, since it's the same as dev->dev_addr. This optimization was suggested by Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Yan Burman <yanb@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-08net/mlx4_en: Optimize loopback related checks in data pathYan Burman1-18/+21
Currently there are relatively complex conditional checks in the fast path, for TX loopback enabling and resulting RX filter logic. Move elaborate if's out of data path, replace them with a single flag for each state and update that state from appropriate places. Also, in native (non SRIOV) mode and not in loopback or in selftest, there is no need to try and filter out packets that HW loopback-ed, as in native mode we do not loopback packets anymore. Signed-off-by: Yan Burman <yanb@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14Merge tag 'rdma-for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull infiniband upate from Roland Dreier: "First batch of InfiniBand/RDMA changes for the 3.8 merge window: - A good chunk of Bart Van Assche's SRP fixes - UAPI disintegration from David Howells - mlx4 support for "64-byte CQE" hardware feature from Or Gerlitz - Other miscellaneous fixes" Fix up trivial conflict in mellanox/mlx4 driver. * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (33 commits) RDMA/nes: Fix for crash when registering zero length MR for CQ RDMA/nes: Fix for terminate timer crash RDMA/nes: Fix for BUG_ON due to adding already-pending timer IB/srp: Allow SRP disconnect through sysfs srp_transport: Document sysfs attributes srp_transport: Simplify attribute initialization code srp_transport: Fix attribute registration IB/srp: Document sysfs attributes IB/srp: send disconnect request without waiting for CM timewait exit IB/srp: destroy and recreate QP and CQs when reconnecting IB/srp: Eliminate state SRP_TARGET_DEAD IB/srp: Introduce the helper function srp_remove_target() IB/srp: Suppress superfluous error messages IB/srp: Process all error completions IB/srp: Introduce srp_handle_qp_err() IB/srp: Simplify SCSI error handling IB/srp: Keep processing commands during host removal IB/srp: Eliminate state SRP_TARGET_CONNECTING IB/srp: Increase block layer timeout RDMA/cm: Change return value from find_gid_port() ...
2012-11-26mlx4: 64-byte CQE/EQE supportOr Gerlitz1-2/+3
ConnectX-3 devices can use either 64- or 32-byte completion queue entries (CQEs) and event queue entries (EQEs). Using 64-byte EQEs/CQEs performs better because each entry is aligned to a complete cacheline. This patch queries the HCA's capabilities, and if it supports 64-byte CQEs and EQES the driver will configure the HW to work in 64-byte mode. The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ. Since this mode is global, userspace (libmlx4) must be updated to work with the configured CQE size, and guests using SR-IOV virtual functions need to know both EQE and CQE size. In case one of the 64-byte CQE/EQE capabilities is activated, the patch makes sure that older guest drivers that use the QUERY_DEV_FUNC command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that they need an update to be able to work with the PPF. This is done by changing the returned pf_context_behaviour not to be zero any more. In case none of these capabilities is activated that value remains zero and older guest drivers can run OK. The SRIOV related flow is as follows 1. the PPF does the detection of the new capabilities using QUERY_DEV_CAP command. 2. the PPF activates the new capabilities using INIT_HCA. 3. the VF detects if the PPF activated the capabilities using QUERY_HCA, and if this is the case activates them for itself too. Note that the VF detects that it must be aware to the new PF behaviour using QUERY_FUNC_CAP. Steps 1 and 2 apply also for native mode. User space notification is done through a new field introduced in struct mlx4_ib_ucontext which holds device capabilities for which user space must take action. This changes the binary interface so the ABI towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only when **needed** i.e. only when the driver does use 64-byte CQEs or future device capabilities which must be in sync by user space. This practice allows to work with unmodified libmlx4 on older devices (e.g A0, B0) which don't support 64-byte CQEs. In order to keep existing systems functional when they update to a newer kernel that contains these changes in VF and userspace ABI, a module parameter enable_64b_cqe_eqe must be set to enable 64-byte mode; the default is currently false. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>