summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-19net/ipv6: Rename addrconf_dst_allocDavid Ahern4-44/+44
addrconf_dst_alloc now returns a fib6_info. Update the name and its users to reflect the change. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net/ipv6: Rename fib6_info struct elementsDavid Ahern8-298/+298
Change the prefix for fib6_info struct elements from rt6i_ to fib6_. rt6i_pcpu and rt6i_exception_bucket are left as is given that they point to rt6_info entries. Rename only; not functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friendsEric Dumazet2-3/+16
After working on IP defragmentation lately, I found that some large packets defeat CHECKSUM_COMPLETE optimization because of NIC adding zero paddings on the last (small) fragment. While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed to CHECKSUM_NONE, forcing a full csum validation, even if all prior fragments had CHECKSUM_COMPLETE set. We can instead compute the checksum of the part we are trimming, usually smaller than the part we keep. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net-next/hinic: add arm64 supportZhao Chen1-1/+1
This patch enables arm64 platform support for the HINIC driver. Signed-off-by: Zhao Chen <zhaochen6@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19Merge branch 'TCP-data-delivery-and-ECN-stats-tracking'David S. Miller6-6/+45
Yuchung Cheng says: ==================== tracking TCP data delivery and ECN stats This patch series improve tracking the data delivery status 1. minor improvement on SYN data 2. accounting bytes delivered with CE marks 3. exporting the delivery stats to applications s.t. users can get better sense of TCP performance at per host, per connection, and even per application message level. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: export packets delivery infoYuchung Cheng5-2/+20
Export data delivered and delivered with CE marks to 1) SNMP TCPDelivered and TCPDeliveredCE 2) getsockopt(TCP_INFO) 3) Timestamping API SOF_TIMESTAMPING_OPT_STATS Note that for SCM_TSTAMP_ACK, the delivery info in SOF_TIMESTAMPING_OPT_STATS is reported before the info was fully updated on the ACK. These stats help application monitor TCP delivery and ECN status on per host, per connection, even per message level. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: track total bytes delivered with ECN CE marksYuchung Cheng3-0/+4
Introduce a new delivered_ce stat in tcp socket to estimate number of packets being marked with CE bits. The estimation is done via ACKs with ECE bit. Depending on the actual receiver behavior, the estimation could have biases. Since the TCP sender can't really see the CE bit in the data path, so the sender is technically counting packets marked delivered with the "ECE / ECN-Echo" flag set. With RFC3168 ECN, because the ECE bit is sticky, this count can drastically overestimate the nummber of CE-marked data packets With DCTCP-style ECN this should be reasonably precise unless there is loss in the ACK path, in which case it's not precise. With AccECN proposal this can be made still more precise, even in the case some degree of ACK loss. However this is sender's best estimate of CE information. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: new helper to calculate newly deliveredYuchung Cheng1-2/+15
Add new helper tcp_newly_delivered() to prepare the ECN accounting change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: better delivery accounting for SYN-ACK and SYN-dataYuchung Cheng1-3/+7
the tcp_sock:delivered has inconsistent accounting for SYN and FIN. 1. it counts pure FIN 2. it counts pure SYN 3. it counts SYN-data twice 4. it does not count SYN-ACK For congestion control perspective it does not matter much as C.C. only cares about the difference not the aboslute value. But the next patch would export this field to user-space so it's better to report the absolute value w/o these caveats. This patch counts SYN, SYN-ACK, or SYN-data delivery once always in the "delivered" field. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19ipv6: frags: fix a lockdep false positiveEric Dumazet1-11/+12
lockdep does not know that the locks used by IPv4 defrag and IPv6 reassembly units are of different classes. It complains because of following chains : 1) sch_direct_xmit() (lock txq->_xmit_lock) dev_hard_start_xmit() xmit_one() dev_queue_xmit_nit() packet_rcv_fanout() ip_check_defrag() ip_defrag() spin_lock() (lock frag queue spinlock) 2) ip6_input_finish() ipv6_frag_rcv() (lock frag queue spinlock) ip6_frag_queue() icmpv6_param_prob() (lock txq->_xmit_lock at some point) We could add lockdep annotations, but we also can make sure IPv6 calls icmpv6_param_prob() only after the release of the frag queue spinlock, since this naturally makes frag queue spinlock a leaf in lock hierarchy. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19hv_netvsc: Add NetVSP v6 and v6.1 into version negotiationHaiyang Zhang2-1/+166
This patch adds the NetVSP v6 and 6.1 message structures, and includes these versions into NetVSC/NetVSP version negotiation process. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19hv_netvsc: propogate Hyper-V friendly name into interface aliasStephen Hemminger2-0/+29
This patch implement the 'Device Naming' feature of the Hyper-V network device API. In Hyper-V on the host through the GUI or PowerShell it is possible to enable the device naming feature which causes the host to make available to the guest the name of the device. This shows up in the RNDIS protocol as the friendly name. The name has no particular meaning and is limited to 256 characters. The value can only be set via PowerShell on the host, but could be scripted for mass deployments. The default value is the string 'Network Adapter' and since that is the same for all devices and useless, the driver ignores it. In Windows, the value goes into a registry key for use in SNMP ifAlias. For Linux, this patch puts the value in the network device alias property; where it is visible in ip tools and SNMP. The host provided ifAlias is just a suggestion, and can be overridden by later ip commands. Also requires exporting dev_set_alias in netdev core. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19Merge branch 'r8169-series-with-further-smaller-improvements'David S. Miller1-211/+133
Heiner Kallweit says: ==================== r8169: series with further smaller improvements This series includes further smaller improvements. Then I think the basic cleanup has been done and next step would be preparing the switch to phylib. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: remove jumbo_tx_csum from chip config structHeiner Kallweit1-79/+54
According to the chip configuration entries only RTL8169 (ver <= 06) supports tx checksumming for jumbo packets. By the way: constant JUMBO_1K is a little misleading because it refers to the standard packet size and not to a jumbo packet size. By implementing this rule we can get rid of configuring tx checksumming support per chip type. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: improve pci region handlingHeiner Kallweit1-11/+5
The region to be used is always the first of type IORESOURCE_MEM. We can implement this rule directly w/o having to specify which region is the first one per configuration entry. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: drop member txd_version from struct rtl8169_privateHeiner Kallweit1-5/+7
txd_version is used in rtl_init_one() only, so we can drop member txd_version from struct rtl8169_private. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: improve rtl8169_get_mac_versionHeiner Kallweit1-11/+1
Certain entries in array mac_info[] are redundant, so remove them: 0x7cf, 0x2c200000 (VER 33): matched by entry 0x7c8, 0x2c000000 0x7cf, 0x28300000 (VER 26): matched by entry 0x7c8, 0x28000000 0x7cf, 0x3cb00000 (VER 24): matched by entry 0x7c8, 0x3c800000 0x7cf, 0x3c400000 (VER 22): matched by entry 0x7c8, 0x3c000000 0x7cf, 0x38500000 (VER 17): matched by entry 0x7c8, 0x38000000 0x7cf, 0x44900000 (VER 39): matched by entry 0x7c8, 0x44800000 0x7cf, 0x40b00000 (VER 30): matched by entry 0x7c8, 0x40800000 0x7cf, 0x40a00000 (VER 30): matched by entry 0x7c8, 0x40800000 0x7cf, 0x34a00000 (VER 09): matched by entry 0x7c8, 0x34800000 0x7cf, 0x24a00000 (VER 09): matched by entry 0x7c8, 0x24800000 In addition don't mask out bits 30 and 29 when printing the XID. Most likely this is a relict from the times when the driver covered RTL8169 chip version only. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: don't display tp->mmio_addr addressHeiner Kallweit1-2/+2
For security reasons since commit ad67b74d2469 "printk: hash addresses printed with %p" %p doesn't display the full address any longer. We could switch to %px, but I think the pointer address doesn't provide a real benefit, so remove printing the hashed address. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: drop member opts1_mask from struct rtl8169_privateHeiner Kallweit1-10/+8
We can get rid of member opts1_mask and in addition save a few cpu cycles in the hot path of rtl_rx(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: change interrupt handler argument typeHeiner Kallweit1-4/+3
Code can be a little simplified by switching the interrupt handler argument type to struct rtl8169_private *. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: change argument type of counters handling functionsHeiner Kallweit1-19/+13
The counter handling functions don't deal with the net_device, so code can be simplified by changing the argument type to struct rtl8169_private *. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: change hw_start argument typeHeiner Kallweit1-26/+15
Code can be simplified by changing the argument type of hw_start callbacks from struct net_device * to struct rtl8169_private *. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: remove rtl8169_map_to_asicHeiner Kallweit1-7/+2
This function is very simple and used only once, so we can inline the two statements. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: replace rx_buf_sz with a constantHeiner Kallweit1-19/+18
rx_buf_sz is constant, so we don't have to pass it as parameter and in general can replace it with a constant. When working on this I noticed that also before in rtl_set_rx_max_size() a value of 0x4000 is set, what is not in line with the chip spec. According to the spec only bits 0..13 are used and we set an effective value of zero therefore. However, the driver still seems to work and due to potential side effects I'm reluctant to make a change. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: remove unneeded check in rtl8169_rx_fillHeiner Kallweit1-3/+0
rtl8169_rx_fill() is called only once and directly before the call array tp->Rx_databuff[] is filled with zero's. Therefore we don't need this check. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: improve rtl8169_init_ringHeiner Kallweit1-6/+4
This function doesn't use the net_device, therefore change the parameter to type struct rtl8169_private * to simplify the code. In addition we don't need the calculations in the memset statements, we can use the size of the arrays directly. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: simplify rtl8169_alloc_rx_dataHeiner Kallweit1-2/+1
dev->dev.parent has the same value as tp_to_dev(tp) (set by SET_NETDEV_DEV() in rtl_init_one()) and we know it can't be NULL. This allows us to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: switch to napi_schedule_irqoffHeiner Kallweit1-1/+1
napi_schedule() is called from hard irq context, so we can switch to napi_schedule_irqoff() and avoid some overhead. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: use constant NAPI_POLL_WAITHeiner Kallweit1-2/+1
We can use generic constant NAPI_POLL_WAIT instead of defining an own constant for the same value. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: use skb_copy_to_linear_data in rtl8169_try_rx_copyHeiner Kallweit1-1/+1
Not a giant leap for mankind, but let's avoid the open-coded memcpy and use standard helper skb_copy_to_linear_data instead. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: remove member align from struct rtl_cfg_infoHeiner Kallweit1-4/+0
Since commit 6f0333b8fde4 "r8169: use 50% less ram for RX ring" member align isn't used any longer, so remove it. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19r8169: remove unused member features from structHeiner Kallweit1-2/+0
Member features of struct rtl8169_private isn't used any longer since commit 6c6aa15fdea5 "r8169: improve interrupt handling", so remove it. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19Merge branch 'netcp-K2G-SoC-support'David S. Miller9-53/+297
Murali Karicheri says: ==================== Add support for netcp driver on K2G SoC K2G SoC is another variant of Keystone family of SoCs. This patch series add support for NetCP driver on this SoC. The QMSS found on K2G SoC is a cut down version of the QMSS found on other keystone devices with less number of queues, internal link ram etc. The patch series has 2 patch sets that goes into the drivers/soc and the rest has to be applied to net sub system. Please review and merge if this looks good. K2G TRM is located at http://www.ti.com/lit/ug/spruhy8g/spruhy8g.pdf Thanks The boot logs on K2G ICE board (tftp boot over Ethernet and from mmc) https://pastebin.ubuntu.com/p/yvZ6drFhkW/ The boot logs on K2G GP board (tftp boot over Ethernet and from mmc) https://pastebin.ubuntu.com/p/QTr6K7s4Zp/ Also regressed boot on K2HK and K2L EVMs as we have modified GBE version detection logic (K2E uses same version of NetCP as in K2L. So regression on one of them is needed). Boot log on K2L and K2HK EVMs are at https://pastebin.ubuntu.com/p/N9DBdPjbvR/ This series applies to net-next master branch. Change history: v4 - ready for merge to net-next Folded the series "Add promiscous mode support in k2g network driver" into this. Fixed a typo in 5/11 (sgmii to rgmii) based on TI internal comment Reworked 4/11 and title changed to reflect additional changes to exclude sgmii configuration code for 2U cpsw. Use IS_SS_ID_2U() macro for customization. Added Reviewed-by from Rob Herring against 1/13 v3 - Addressed comments from Andrew Lunn and Grygorii Strashko against v2. v2 - Addressed following comments on initial version - split patch 3/5 to multiple patches from Andrew Lunn ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: k2g: add promiscuous mode supportWingMan Kwok1-0/+56
This patch adds support for promiscuous mode in k2g's network driver. When upper layer instructs to transition from non-promiscuous mode to promiscuous mode or vice versa K2G network driver needs to configure ALE accordingly so that in case of non-promiscuous mode, ALE will not flood all unicast packets to host port, while in promiscuous mode, it will pass all received unicast packets to host port. Signed-off-by: WingMan Kwok <w-kwok2@ti.com> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: add api to support set rx mode in netcp modulesWingMan Kwok2-0/+20
This patch adds an API to support setting rx mode in netcp modules. If a netcp module needs to be notified when upper layer transitions from one rx mode to another and react accordingly, such a module will implement the new API set_rx_mode added in this patch. Currently rx modes supported are PROMISCUOUS and NON_PROMISCUOUS modes. Signed-off-by: WingMan Kwok <w-kwok2@ti.com> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: support probe deferralMurali Karicheri1-0/+4
The netcp driver shouldn't proceed until the knav qmss and dma devices are ready. So return -EPROBE_DEFER if these devices are not ready. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19Revert "net: netcp: remove dead code from the driver"Murali Karicheri1-0/+9
As the probe sequence is not guaranteed contrary to the assumption of the commit 2d8e276a9030, same has to be reverted. commit 2d8e276a9030 ("net: netcp: remove dead code from the driver") Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: use of_get_phy_mode() to support different RGMII modesMurali Karicheri1-0/+18
The phy used for K2G allows for internal delays to be added optionally to the clock circuitry based on board desing. To add this support, enhance the driver to use of_get_phy_mode() to read the phy-mode from the phy device and pass the same to phy through of_phy_connect(). Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: re-use stats handling code for 2u hardwareMurali Karicheri1-1/+1
The stats block in 2u cpsw hardware is similar to the one on nu and hence handle it in a similar way by using a macro that includes 2u hardware as well. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: map vlan priorities to zero flowMurali Karicheri1-0/+9
The driver currently support only vlan priority zero. So map the vlan priorities to zero flow in hardware. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: use rgmii link status for 2u cpsw hardwareMurali Karicheri1-7/+27
Introduce rgmii link status to handle link state events for 2u cpsw hardware on K2G. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: add support for handling rgmii link interfaceMurali Karicheri2-4/+13
2u cpsw hardware on K2G uses rgmii link to interface with Phy. So add support for this interface in the code so that driver can be re-used for 2u hardware. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: make sgmii configuration conditionalMurali Karicheri1-5/+13
As a preparatory patch to add support for 2u cpsw hardware found on K2G SoC, make sgmii configuration conditional. This is required since 2u uses RGMII interface instead of SGMII and to allow for driver re-use. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: netcp: ethss: use macro for checking ss_version consistentlyMurali Karicheri1-13/+16
Driver currently uses macro for NU and XBE hardwrae, while other places for older hardware such as that on K2H/K SoC (version 1.4 of the cpsw hardware, it explicitly check for the ss_version inline. Add a new macro for version 1.4 and use it to customize code in the driver. While at it also fix similar issue with checking XBE version by re-using existing macro IS_SS_ID_XGBE(). Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19soc: ti: K2G: provide APIs to support driver probe deferralMurali Karicheri4-0/+29
This patch provide APIs to allow client drivers to support probe deferral. On K2G SoC, devices can be probed only after the ti_sci_pm_domains driver is probed and ready. As drivers may get probed at different order, any driver that depends on knav dma and qmss drivers, for example netcp network driver, needs to defer probe until knav devices are probed and ready to service. To do this, add an API to query the device ready status from the knav dma and qmss devices. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19soc: ti: K2G: enhancement to support QMSS in K2G NAVSSMurali Karicheri3-23/+82
Navigator Subsystem (NAVSS) available on K2G SoC has a cut down version of QMSS with less number of queues, internal linking ram with lesser number of buffers etc. It doesn't have status and explicit push register space as in QMSS available on other K2 SoCs. So define reg indices specific to QMSS on K2G. This patch introduces "ti,66ak2g-navss-qm" compatibility to identify QMSS on K2G NAVSS and to customize the dts handling code. Per Device manual, descriptors with index less than or equal to regions0_size is in region 0 in the case of K2 QMSS where as for QMSS on K2G, descriptors with index less than regions0_size is in region 0. So update the size accordingly in the regions0_size bits of the linking ram size 0 register. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: WingMan Kwok <w-kwok2@ti.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18Merge branch 'ipv6-Separate-data-structures-for-FIB-and-data-path'David S. Miller19-1136/+1218
David Ahern says: ==================== net/ipv6: Separate data structures for FIB and data path IPv6 uses the same data struct for both control plane (FIB entries) and data path (dst entries). This struct has elements needed for both paths adding memory overhead and complexity (taking a dst hold in most places but an additional reference on rt6i_ref in a few). Furthermore, because of the dst_alloc tie, all FIB entries are allocated with GFP_ATOMIC. This patch set separates FIB entries from dst entries, better aligning IPv6 code with IPv4, simplifying the reference counting and allowing FIB entries added by userspace (not autoconf) to use GFP_KERNEL. It is first step to a number of performance and scalability changes. The end result of this patch set: - FIB entries (fib6_info): /* size: 208, cachelines: 4, members: 25 */ /* sum members: 207, holes: 1, sum holes: 1 */ - dst entries (rt6_info) /* size: 240, cachelines: 4, members: 11 */ Versus the the single rt6_info struct today for both paths: /* size: 320, cachelines: 5, members: 28 */ This amounts to a 35% reduction in memory use for FIB entries and a 25% reduction for dst entries. With respect to locking FIB entries use RCU and a single atomic counter with fib6_info_hold and fib6_info_release helpers to manage the reference counting. dst entries use only the traditional dst refcounts with dst_hold and dst_release. FIB entries for host routes are referenced by inet6_ifaddr and ifacaddr6. In both cases, additional holds are taken -- similar to what is done for devices. This set is the first of many changes to improve the scalability of the IPv6 code. Follow on changes include: - consolidating duplicate fib6_info references like IPv4 does with duplicate fib_info - moving fib6_info into a slab cache to avoid allocation roundups to power of 2 (the 208 size becomes a 256 actual allocation) - Allow FIB lookups without generating a dst (e.g., most rt6_lookup users just want to verify the egress device). Means moving dst allocation to the other side of fib6_rule_lookup which again aligns with IPv4 behavior - using separate standalone nexthop objects which have performance benefits beyond fib_info consolidation At this point I am not seeing any refcount leaks or underflows, no oops or bug_ons, or warnings from kasan, so I think it is ready for others to beat up on it finding errors in code paths I have missed. v2 changes - rebased to top of tree - improved commit message on patch 7 v1 changes - rebased to top of tree - fix memory leak of metrics as noted by Ido - MTU fixes based on pmtu tests (thanks Stefano Brivio for writing) RFC v2 changes - improved commit messages - move common metrics code from dst.c to net/ipv4/metrics.c (comment from DaveM) - address comments from Wei Wang and Martin KaFai Lau (let me know if I missed something) - fixes detected by kernel test robots + added fib6_metric_set to change metric on a FIB entry which could be pointing to read-only dst_default_metrics + 0day testing found a problem with an intermediate patch; added dst_hold_safe on rt->from. Code is removed 3 patches later - allow cacheinfo to handle NULL dst; means only expires is pushed to userspace ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18net/ipv6: Remove unused code and variables for rt6_infoDavid Ahern4-106/+5
Drop unneeded elements from rt6_info struct and rearrange layout to something more relevant for the data path. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18net/ipv6: Flip FIB entries to fib6_infoDavid Ahern10-270/+271
Convert all code paths referencing a FIB entry from rt6_info to fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18net/ipv6: separate handling of FIB entries from dst based routesDavid Ahern8-152/+115
Last step before flipping the data type for FIB entries: - use fib6_info_alloc to create FIB entries in ip6_route_info_create and addrconf_dst_alloc - use fib6_info_release in place of dst_release, ip6_rt_put and rt6_release - remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt - when purging routes, drop per-cpu routes - replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release - use rt->from since it points to the FIB entry - drop references to exception bucket, fib6_metrics and per-cpu from dst entries (those are relevant for fib entries only) Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>