summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2020-04-27bridge: mrp: Implement netlink interface to configure MRPHoratiu Vultur1-0/+91
Implement netlink interface to configure MRP. The implementation will do sanity checks over the attributes and then eventually call the MRP interface. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27bridge: mrp: Connect MRP API with the switchdev APIHoratiu Vultur3-1/+589
Implement the MRP API. In case the HW can't generate MRP Test frames then the SW will try to generate the frames. In case that also the SW will fail in generating the frames then a error is return to the userspace. The userspace is responsible to generate all the other MRP frames regardless if the test frames are generated by HW or SW. The forwarding/termination of MRP frames is happening in the kernel and is done by the MRP instance. The userspace application doesn't do the forwarding. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27bridge: switchdev: mrp: Implement MRP API for switchdevHoratiu Vultur2-0/+142
Implement the MRP api for switchdev. These functions will just eventually call the switchdev functions: switchdev_port_obj_add/del and switchdev_port_attr_set. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27bridge: mrp: Add MRP interface.Horatiu Vultur1-0/+63
Define the MRP interface. This interface is used by the netlink to update the MRP instances and by the MRP to make the calls to switchdev to offload it to HW. It defines an MRP instance 'struct br_mrp' which is a list of MRP instances. Which will be part of the 'struct net_bridge'. Each instance has 2 ring ports, a bridge and an ID. In case the HW can't generate MRP Test frames then the SW will generate those. br_mrp_add - adds a new MRP instance. br_mrp_del - deletes an existing MRP instance. Each instance has an ID(ring_id). br_mrp_set_port_state - changes the port state. The port can be in forwarding state, which means that the frames can pass through or in blocked state which means that the frames can't pass through except MRP frames. This will eventually call the switchdev API to notify the HW. This information is used also by the SW bridge to know how to forward frames in case the HW doesn't have this capability. br_mrp_set_port_role - a port role can be primary or secondary. This information is required to be pushed to HW in case the HW can generate MRP_Test frames. Because the MRP_Test frames contains a file with this information. Otherwise the HW will not be able to generate the frames correctly. br_mrp_set_ring_state - a ring can be in state open or closed. State open means that the mrp port stopped receiving MRP_Test frames, while closed means that the mrp port received MRP_Test frames. Similar with br_mrp_port_role, this information is pushed in HW because the MRP_Test frames contain this information. br_mrp_set_ring_role - a ring can have the following roles MRM or MRC. For the role MRM it is expected that the HW can terminate the MRP frames, notify the SW that it stopped receiving MRP_Test frames and trapp all the other MRP frames. While for MRC mode it is expected that the HW can forward the MRP frames only between the MRP ports and copy MRP_Topology frames to CPU. In case the HW doesn't support a role it needs to return an error code different than -EOPNOTSUPP. br_mrp_start_test - this starts/stops the generation of MRP_Test frames. To stop the generation of frames the interval needs to have a value of 0. In this case the userspace needs to know if the HW supports this or not. Not to have duplicate frames(generated by HW and SW). Because if the HW supports this then the SW will not generate anymore frames and will expect that the HW will notify when it stopped receiving MRP frames using the function br_mrp_port_open. br_mrp_port_open - this function is used by drivers to notify the userspace via a netlink callback that one of the ports stopped receiving MRP_Test frames. This function is called only when the node has the role MRM. It is not supposed to be called from userspace. br_mrp_port_switchdev_add - this corresponds to the function br_mrp_add, and will notify the HW that a MRP instance is added. The function gets as parameter the MRP instance. br_mrp_port_switchdev_del - this corresponds to the function br_mrp_del, and will notify the HW that a MRP instance is removed. The function gets as parameter the ID of the MRP instance that is removed. br_mrp_port_switchdev_set_state - this corresponds to the function br_mrp_set_port_state. It would notify the HW if it should block or not non-MRP frames. br_mrp_port_switchdev_set_port - this corresponds to the function br_mrp_set_port_role. It would set the port role, primary or secondary. br_mrp_switchdev_set_role - this corresponds to the function br_mrp_set_ring_role and would set one of the role MRM or MRC. br_mrp_switchdev_set_ring_state - this corresponds to the function br_mrp_set_ring_state and would set the ring to be open or closed. br_mrp_switchdev_send_ring_test - this corresponds to the function br_mrp_start_test. This will notify the HW to start or stop generating MRP_Test frames. Value 0 for the interval parameter means to stop generating the frames. br_mrp_port_open - this function is used to notify the userspace that the port lost the continuity of MRP Test frames. Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27net: bridge: Add port attribute IFLA_BRPORT_MRP_RING_OPENHoratiu Vultur1-0/+3
This patch adds a new port attribute, IFLA_BRPORT_MRP_RING_OPEN, which allows to notify the userspace when the port lost the continuite of MRP frames. This attribute is set by kernel whenever the SW or HW detects that the ring is being open or closed. Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27bridge: mrp: Extend bridge interfaceHoratiu Vultur1-0/+4
To integrate MRP into the bridge, first the bridge needs to be aware of ports that are part of an MRP ring and which rings are on the bridge. Therefore extend bridge interface with the following: - add new flag(BR_MPP_AWARE) to the net bridge ports, this bit will be set when the port is added to an MRP instance. In this way it knows if the frame was received on MRP ring port - add new flag(BR_MRP_LOST_CONT) to the net bridge ports, this bit will be set when the port lost the continuity of MRP Test frames. - add a list of MRP instances Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27bridge: mrp: Update KconfigHoratiu Vultur1-0/+12
Add the option BRIDGE_MRP to allow to build in or not MRP support. The default value is N. Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27net: rtnetlink: remove redundant assignment to variable errColin Ian King1-1/+1
The variable err is being initializeed with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26net: openvswitch: use div_u64() for 64-by-32 divisionsTonghao Zhang1-1/+1
Compile the kernel for arm 32 platform, the build warning found. To fix that, should use div_u64() for divisions. | net/openvswitch/meter.c:396: undefined reference to `__udivdi3' [add more commit msg, change reported tag, and use div_u64 instead of do_div by Tonghao] Fixes: e57358873bb5d6ca ("net: openvswitch: use u64 for meter bucket") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26net: openvswitch: suitable access to the dp_metersTonghao Zhang1-3/+3
To fix the following sparse warning: | net/openvswitch/meter.c:109:38: sparse: sparse: incorrect type | in assignment (different address spaces) ... | net/openvswitch/meter.c:720:45: sparse: sparse: incorrect type | in argument 1 (different address spaces) ... Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26dccp: remove unused inline function dccp_set_seqnoYueHaibing1-5/+0
There's no callers in-tree since commit 792b48780e8b ("dccp: Implement both feature-local and feature-remote Sequence Window feature") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26hsr: remove unnecessary code in hsr_dev_change_mtu()Taehee Yoo1-3/+1
In the hsr_dev_change_mtu(), the 'dev' and 'master->dev' pointer are same. So, the 'master' variable and some code are unnecessary. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26tcp: mptcp: use mptcp receive buffer space to select rcv windowFlorian Westphal2-2/+24
In MPTCP, the receive window is shared across all subflows, because it refers to the mptcp-level sequence space. MPTCP receivers already place incoming packets on the mptcp socket receive queue and will charge it to the mptcp socket rcvbuf until userspace consumes the data. Update __tcp_select_window to use the occupancy of the parent/mptcp socket instead of the subflow socket in case the tcp socket is part of a logical mptcp connection. This commit doesn't change choice of initial window for passive or active connections. While it would be possible to change those as well, this adds complexity (especially when handling MP_JOIN requests). Furthermore, the MPTCP RFC specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field of a TCP segment at the connection level if it does not also carry a DSS option with a Data ACK field.' SYN/SYNACK packets do not carry a DSS option with a Data ACK field. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller38-134/+238
Simple overlapping changes to linux/vermagic.h Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds29-105/+191
Pull networking fixes from David Miller: 1) Fix memory leak in netfilter flowtable, from Roi Dayan. 2) Ref-count leaks in netrom and tipc, from Xiyu Yang. 3) Fix warning when mptcp socket is never accepted before close, from Florian Westphal. 4) Missed locking in ovs_ct_exit(), from Tonghao Zhang. 5) Fix large delays during PTP synchornization in cxgb4, from Rahul Lakkireddy. 6) team_mode_get() can hang, from Taehee Yoo. 7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver, from Niklas Schnelle. 8) Fix handling of bpf XADD on BTF memory, from Jann Horn. 9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson. 10) Missing queue memory release in iwlwifi pcie code, from Johannes Berg. 11) Fix NULL deref in macvlan device event, from Taehee Yoo. 12) Initialize lan87xx phy correctly, from Yuiko Oshino. 13) Fix looping between VRF and XFRM lookups, from David Ahern. 14) etf packet scheduler assumes all sockets are full sockets, which is not necessarily true. From Eric Dumazet. 15) Fix mptcp data_fin handling in RX path, from Paolo Abeni. 16) fib_select_default() needs to handle nexthop objects, from David Ahern. 17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun. 18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca. 19) Correct rx/tx stats in bcmgenet driver, from Doug Berger. 20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix from Luke Nelson. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits) selftests/bpf: Fix a couple of broken test_btf cases tools/runqslower: Ensure own vmlinux.h is picked up first bpf: Make bpf_link_fops static bpftool: Respect the -d option in struct_ops cmd selftests/bpf: Add test for freplace program with expected_attach_type bpf: Propagate expected_attach_type when verifying freplace programs bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd bpf, x86_32: Fix logic error in BPF_LDX zero-extension bpf, x86_32: Fix clobbering of dst for BPF_JSET bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension bpf: Fix reStructuredText markup net: systemport: suppress warnings on failed Rx SKB allocations net: bcmgenet: suppress warnings on failed Rx SKB allocations macsec: avoid to set wrong mtu mac80211: sta_info: Add lockdep condition for RCU list usage mac80211: populate debugfs only after cfg80211 init net: bcmgenet: correct per TX/RX ring statistics net: meth: remove spurious copyright text net: phy: bcm84881: clear settings on link down chcr: Fix CPU hard lockup ...
2020-04-25net: phylink, dsa: eliminate phylink_fixed_state_cb()Russell King1-9/+11
Move the callback into the phylink_config structure, rather than providing a callback to set this up. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25net: sched: report ndo_setup_tc failures via extackJesper Dangaard Brouer1-1/+4
Help end-users of the 'tc' command to see if the drivers ndo_setup_tc function call fails. Troubleshooting when this happens is non-trivial (see full process here[1]), and results in net_device getting assigned the 'qdisc noop', which will drop all TX packets on the interface. [1]: https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/board_nxp_ls1088/nxp-board04-troubleshoot-qdisc.org Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24Merge tag 'mac80211-for-net-2020-04-24' of ↵David S. Miller5-20/+45
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just three changes: * fix a wrong GFP_KERNEL in hwsim * fix the debugfs mess after the mac80211 registration race fix * suppress false-positive RCU list lockdep warnings ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24mac80211: sta_info: Add lockdep condition for RCU list usageMadhuparna Bhowmik1-1/+2
The function sta_info_get_by_idx() uses RCU list primitive. It is called with local->sta_mtx held from mac80211/cfg.c. Add lockdep expression to avoid any false positive RCU list warnings. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Link: https://lore.kernel.org/r/20200409082906.27427-1-madhuparnabhowmik10@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24mac80211: populate debugfs only after cfg80211 initJohannes Berg4-19/+43
When fixing the initialization race, we neglected to account for the fact that debugfs is initialized in wiphy_register(), and some debugfs things went missing (or rather were rerooted to the global debugfs root). Fix this by adding debugfs entries only after wiphy_register(). This requires some changes in the rate control code since it currently adds debugfs at alloc time, which can no longer be done after the reordering. Reported-by: Jouni Malinen <j@w1.fi> Reported-by: kernel test robot <rong.a.chen@intel.com> Reported-by: Hauke Mehrtens <hauke@hauke-m.de> Reported-by: Felix Fietkau <nbd@nbd.name> Cc: stable@vger.kernel.org Fixes: 52e04b4ce5d0 ("mac80211: fix race in ieee80211_register_hw()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20200423111344.0e00d3346f12.Iadc76a03a55093d94391fc672e996a458702875d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24net: openvswitch: use u64 for meter bucketTonghao Zhang2-2/+2
When setting the meter rate to 4+Gbps, there is an overflow, the meters don't work as expected. Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Andy Zhou <azhou@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: openvswitch: make EINVAL return value more obviousTonghao Zhang1-3/+2
Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Andy Zhou <azhou@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: openvswitch: remove the unnecessary checkTonghao Zhang1-5/+4
Before invoking the ovs_meter_cmd_reply_stats, "meter" was checked, so don't check it agin in that function. Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Andy Zhou <azhou@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: openvswitch: set max limitation to metersTonghao Zhang2-10/+49
Don't allow user to create meter unlimitedly, which may cause to consume a large amount of kernel memory. The max number supported is decided by physical memory and 20K meters as default. Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Andy Zhou <azhou@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: openvswitch: expand the meters supported numberTonghao Zhang3-63/+195
In kernel datapath of Open vSwitch, there are only 1024 buckets of meter in one datapath. If installing more than 1024 (e.g. 8192) meters, it may lead to the performance drop. But in some case, for example, Open vSwitch used as edge gateway, there should be 20K at least, where meters used for IP address bandwidth limitation. [Open vSwitch userspace datapath has this issue too.] For more scalable meter, this patch use meter array instead of hash tables, and expand/shrink the array when necessary. So we can install more meters than before in the datapath. Introducing the struct *dp_meter_instance, it's easy to expand meter though changing the *ti point in the struct *dp_meter_table. Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Andy Zhou <azhou@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: sched : Remove unnecessary cast in kfreeXu Wang1-1/+1
Remove unnecassary casts in the argument to kfree. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net/x25: Fix x25_neigh refcnt leak when receiving frameXiyu Yang1-1/+3
x25_lapb_receive_frame() invokes x25_get_neigh(), which returns a reference of the specified x25_neigh object to "nb" with increased refcnt. When x25_lapb_receive_frame() returns, local variable "nb" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one path of x25_lapb_receive_frame(). When pskb_may_pull() returns false, the function forgets to decrease the refcnt increased by x25_get_neigh(), causing a refcnt leak. Fix this issue by calling x25_neigh_put() when pskb_may_pull() returns false. Fixes: cb101ed2c3c7 ("x25: Handle undersized/fragmented skbs") Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24mptcp/pm_netlink.c : add check for nla_put_in/6_addrBo YU1-5/+7
Normal there should be checked for nla_put_in6_addr like other usage in net. Detected by CoverityScan, CID# 1461639 Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Signed-off-by: Bo YU <tsu.yubo@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: napi: use READ_ONCE()/WRITE_ONCE()Eric Dumazet2-5/+5
gro_flush_timeout and napi_defer_hard_irqs can be read from napi_complete_done() while other cpus write the value, whithout explicit synchronization. Use READ_ONCE()/WRITE_ONCE() to annotate the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: napi: add hard irqs deferral featureEric Dumazet2-11/+36
Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer") we added the ability to arm one high resolution timer, that we used to keep not-complete packets in GRO engine a bit longer, hoping that further frames might be added to them. Since then, we added the napi_complete_done() interface, and commit 364b6055738b ("net: busy-poll: return busypolling status to drivers") allowed drivers to avoid re-arming NIC interrupts if we made a promise that their NAPI poll() handler would be called in the near future. This infrastructure can be leveraged, thanks to a new device parameter, which allows to arm the napi hrtimer, instead of re-arming the device hard IRQ. We have noticed that on some servers with 32 RX queues or more, the chit-chat between the NIC and the host caused by IRQ delivery and re-arming could hurt throughput by ~20% on 100Gbit NIC. In contrast, hrtimers are using local (percpu) resources and might have lower cost. The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy than gro_flush_timeout (/sys/class/net/ethX/) By default, both gro_flush_timeout and napi_defer_hard_irqs are zero. This patch does not change the prior behavior of gro_flush_timeout if used alone : NIC hard irqs should be rearmed as before. One concrete usage can be : echo 20000 >/sys/class/net/eth1/gro_flush_timeout echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs If at least one packet is retired, then we will reset napi counter to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans of the queue. On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ avoidance was only possible if napi->poll() was exhausting its budget and not call napi_complete_done(). This feature also can be used to work around some non-optimal NIC irq coalescing strategies. Having the ability to insert XX usec delays between each napi->poll() can increase cache efficiency, since we increase batch sizes. It also keeps serving cpus not idle too long, reducing tail latencies. Co-developed-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: dsa: add GRO support via gro_cellsAlexander Lobakin4-2/+14
gro_cells lib is used by different encapsulating netdevices, such as geneve, macsec, vxlan etc. to speed up decapsulated traffic processing. CPU tag is a sort of "encapsulation", and we can use the same mechs to greatly improve overall DSA performance. skbs are passed to the GRO layer after removing CPU tags, so we don't need any new packet offload types as it was firstly proposed by me in the first GRO-over-DSA variant [1]. The size of struct gro_cells is sizeof(void *), so hot struct dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields remain in one 32-byte cacheline. The other positive side effect is that drivers for network devices that can be shipped as CPU ports of DSA-driven switches can now use napi_gro_frags() to pass skbs to kernel. Packets built that way are completely non-linear and are likely being dropped without GRO. This was tested on to-be-mainlined-soon Ethernet driver that uses napi_gro_frags(), and the overall performance was on par with the variant from [1], sometimes even better due to minimal overhead. net.core.gro_normal_batch tuning may help to push it to the limit on particular setups and platforms. iperf3 IPoE VLAN NAT TCP forwarding (port1.218 -> port0) setup on 1.2 GHz MIPS board: 5.7-rc2 baseline: [ID] Interval Transfer Bitrate Retr [ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender [ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver Iface RX packets TX packets eth0 7097731 7097702 port0 426050 6671829 port1 6671681 425862 port1.218 6671677 425851 With this patch: [ID] Interval Transfer Bitrate Retr [ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 122 sender [ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver Iface RX packets TX packets eth0 9474792 9474777 port0 455200 353288 port1 9019592 455035 port1.218 353144 455024 v2: - Add some performance examples in the commit message; - No functional changes. [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/ Signed-off-by: Alexander Lobakin <bloodyreaper@yandex.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23ipv6: Honor all IPv6 PIO Valid Lifetime valuesFernando Gont1-20/+7
RFC4862 5.5.3 e) prevents received Router Advertisements from reducing the Valid Lifetime of configured addresses to less than two hours, thus preventing hosts from reacting to the information provided by a router that has positive knowledge that a prefix has become invalid. This patch makes hosts honor all Valid Lifetime values, as per draft-gont-6man-slaac-renum-06, Section 4.2. This is meant to help mitigate the problem discussed in draft-ietf-v6ops-slaac-renum. Note: Attacks aiming at disabling an advertised prefix via a Valid Lifetime of 0 are not really more harmful than other attacks that can be performed via forged RA messages, such as those aiming at completely disabling a next-hop router via an RA that advertises a Router Lifetime of 0, or performing a Denial of Service (DoS) attack by advertising illegitimate prefixes via forged PIOs. In scenarios where RA-based attacks are of concern, proper mitigations such as RA-Guard [RFC6105] [RFC7113] should be implemented. Signed-off-by: Fernando Gont <fgont@si6networks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23Merge tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds9-29/+47
Pull nfsd fixes from Chuck Lever: "The first set of 5.7-rc fixes for NFS server issues. These were all unresolved at the time the 5.7 window opened, and needed some additional time to ensure they were correctly addressed. They are ready now. At the moment I know of one more urgent issue regarding the NFS server. A fix has been tested and is under review. I expect to send one more pull request, containing this fix (which now consists of 3 patches). Fixes: - Address several use-after-free and memory leak bugs - Prevent a backchannel livelock" * tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6: svcrdma: Fix leak of svc_rdma_recv_ctxt objects svcrdma: Fix trace point use-after-free race SUNRPC: Fix backchannel RPC soft lockups SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge nfsd: memory corruption in nfsd4_lock()
2020-04-23ipv4: Update fib_select_default to handle nexthop objectsDavid Ahern1-3/+3
A user reported [0] hitting the WARN_ON in fib_info_nh: [ 8633.839816] ------------[ cut here ]------------ [ 8633.839819] WARNING: CPU: 0 PID: 1719 at include/net/nexthop.h:251 fib_select_path+0x303/0x381 ... [ 8633.839846] RIP: 0010:fib_select_path+0x303/0x381 ... [ 8633.839848] RSP: 0018:ffffb04d407f7d00 EFLAGS: 00010286 [ 8633.839850] RAX: 0000000000000000 RBX: ffff9460b9897ee8 RCX: 00000000000000fe [ 8633.839851] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000000 [ 8633.839852] RBP: ffff946076049850 R08: 0000000059263a83 R09: ffff9460840e4000 [ 8633.839853] R10: 0000000000000014 R11: 0000000000000000 R12: ffffb04d407f7dc0 [ 8633.839854] R13: ffffffffa4ce3240 R14: 0000000000000000 R15: ffff9460b7681f60 [ 8633.839857] FS: 00007fcac2e02700(0000) GS:ffff9460bdc00000(0000) knlGS:0000000000000000 [ 8633.839858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8633.839859] CR2: 00007f27beb77e28 CR3: 0000000077734000 CR4: 00000000000006f0 [ 8633.839867] Call Trace: [ 8633.839871] ip_route_output_key_hash_rcu+0x421/0x890 [ 8633.839873] ip_route_output_key_hash+0x5e/0x80 [ 8633.839876] ip_route_output_flow+0x1a/0x50 [ 8633.839878] __ip4_datagram_connect+0x154/0x310 [ 8633.839880] ip4_datagram_connect+0x28/0x40 [ 8633.839882] __sys_connect+0xd6/0x100 ... The WARN_ON is triggered in fib_select_default which is invoked when there are multiple default routes. Update the function to use fib_info_nhc and convert the nexthop checks to use fib_nh_common. Add test case that covers the affected code path. [0] https://github.com/FRRouting/frr/issues/6089 Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23netlabel: Kconfig: Update reference for NetLabel Tools projectSalvatore Bonaccorso1-1/+1
The NetLabel Tools project has moved from http://netlabel.sf.net to a GitHub project. Update to directly refer to the new home for the tools. Signed-off-by: Salvatore Bonaccorso <carnil@debian.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23mptcp: fix data_fin handing in RX pathPaolo Abeni1-2/+1
The data fin flag is set only via a DSS option, but mptcp_incoming_options() copies it unconditionally from the provided RX options. Since we do not clear all the mptcp sock RX options in a socket free/alloc cycle, we can end-up with a stray data_fin value while parsing e.g. MPC packets. That would lead to mapping data corruption and will trigger a few WARN_ON() in the RX path. Instead of adding a costly memset(), fetch the data_fin flag only for DSS packets - when we always explicitly initialize such bit at option parsing time. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: caif: use true,false for bool variablesJason Yan1-4/+4
Fix the following coccicheck warning: net/caif/caif_dev.c:410:2-13: WARNING: Assignment of 0/1 to bool variable net/caif/caif_dev.c:445:2-13: WARNING: Assignment of 0/1 to bool variable net/caif/caif_dev.c:145:1-12: WARNING: Assignment of 0/1 to bool variable net/caif/caif_dev.c:223:1-12: WARNING: Assignment of 0/1 to bool variable Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23sctp: Fix SHUTDOWN CTSN Ack in the peer restart caseJere Leppänen1-1/+5
When starting shutdown in sctp_sf_do_dupcook_a(), get the value for SHUTDOWN Cumulative TSN Ack from the new association, which is reconstructed from the cookie, instead of the old association, which the peer doesn't have anymore. Otherwise the SHUTDOWN is either ignored or replied to with an ABORT by the peer because CTSN Ack doesn't match the peer's Initial TSN. Fixes: bdf6fa52f01b ("sctp: handle association restarts when the socket is closed.") Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23sctp: Fix bundling of SHUTDOWN with COOKIE-ACKJere Leppänen1-3/+3
When we start shutdown in sctp_sf_do_dupcook_a(), we want to bundle the SHUTDOWN with the COOKIE-ACK to ensure that the peer receives them at the same time and in the correct order. This bundling was broken by commit 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc"), which assigns a transport for the COOKIE-ACK, but not for the SHUTDOWN. Fix this by passing a reference to the COOKIE-ACK chunk as an argument to sctp_sf_do_9_2_start_shutdown() and onward to sctp_make_shutdown(). This way the SHUTDOWN chunk is assigned the same transport as the COOKIE-ACK chunk, which allows them to be bundled. In sctp_sf_do_9_2_start_shutdown(), the void *arg parameter was previously unused. Now that we're taking it into use, it must be a valid pointer to a chunk, or NULL. There is only one call site where it's not, in sctp_sf_autoclose_timer_expire(). Fix that too. Fixes: 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc") Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: dsa: don't fail to probe if we couldn't set the MTUVladimir Oltean1-5/+3
There is no reason to fail the probing of the switch if the MTU couldn't be configured correctly (either the switch port itself, or the host port) for whatever reason. MTU-sized traffic probably won't work, sure, but we can still probably limp on and support some form of communication anyway, which the users would probably appreciate more. Fixes: bfcb813203e6 ("net: dsa: configure the MTU for switch ports") Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23sched: etf: do not assume all sockets are full blownEric Dumazet1-3/+4
skb->sk does not always point to a full blown socket, we need to use sk_fullsock() before accessing fields which only make sense on full socket. BUG: KASAN: use-after-free in report_sock_error+0x286/0x300 net/sched/sch_etf.c:141 Read of size 1 at addr ffff88805eb9b245 by task syz-executor.5/9630 CPU: 1 PID: 9630 Comm: syz-executor.5 Not tainted 5.7.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382 __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511 kasan_report+0x33/0x50 mm/kasan/common.c:625 report_sock_error+0x286/0x300 net/sched/sch_etf.c:141 etf_enqueue_timesortedlist+0x389/0x740 net/sched/sch_etf.c:170 __dev_xmit_skb net/core/dev.c:3710 [inline] __dev_queue_xmit+0x154a/0x30a0 net/core/dev.c:4021 neigh_hh_output include/net/neighbour.h:499 [inline] neigh_output include/net/neighbour.h:508 [inline] ip6_finish_output2+0xfb5/0x25b0 net/ipv6/ip6_output.c:117 __ip6_finish_output+0x442/0xab0 net/ipv6/ip6_output.c:143 ip6_finish_output+0x34/0x1f0 net/ipv6/ip6_output.c:153 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x239/0x810 net/ipv6/ip6_output.c:176 dst_output include/net/dst.h:435 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip6_xmit+0xe1a/0x2090 net/ipv6/ip6_output.c:280 tcp_v6_send_synack+0x4e7/0x960 net/ipv6/tcp_ipv6.c:521 tcp_rtx_synack+0x10d/0x1a0 net/ipv4/tcp_output.c:3916 inet_rtx_syn_ack net/ipv4/inet_connection_sock.c:669 [inline] reqsk_timer_handler+0x4c2/0xb40 net/ipv4/inet_connection_sock.c:763 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1405 expire_timers kernel/time/timer.c:1450 [inline] __run_timers kernel/time/timer.c:1774 [inline] __run_timers kernel/time/timer.c:1741 [inline] run_timer_softirq+0x623/0x1600 kernel/time/timer.c:1787 __do_softirq+0x26c/0x9f7 kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x192/0x1d0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:546 [inline] smp_apic_timer_interrupt+0x19e/0x600 arch/x86/kernel/apic/apic.c:1140 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829 </IRQ> RIP: 0010:des_encrypt+0x157/0x9c0 lib/crypto/des.c:792 Code: 85 22 06 00 00 41 31 dc 41 8b 4d 04 44 89 e2 41 83 e4 3f 4a 8d 3c a5 60 72 72 88 81 e2 3f 3f 3f 3f 48 89 f8 48 c1 e8 03 31 d9 <0f> b6 34 28 48 89 f8 c1 c9 04 83 e0 07 83 c0 03 40 38 f0 7c 09 40 RSP: 0018:ffffc90003b5f6c0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 RAX: 1ffffffff10e4e55 RBX: 00000000d2f846d0 RCX: 00000000d2f846d0 RDX: 0000000012380612 RSI: ffffffff839863ca RDI: ffffffff887272a8 RBP: dffffc0000000000 R08: ffff888091d0a380 R09: 0000000000800081 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012 R13: ffff8880a8ae8078 R14: 00000000c545c93e R15: 0000000000000006 cipher_crypt_one crypto/cipher.c:75 [inline] crypto_cipher_encrypt_one+0x124/0x210 crypto/cipher.c:82 crypto_cbcmac_digest_update+0x1b5/0x250 crypto/ccm.c:830 crypto_shash_update+0xc4/0x120 crypto/shash.c:119 shash_ahash_update+0xa3/0x110 crypto/shash.c:246 crypto_ahash_update include/crypto/hash.h:547 [inline] hash_sendmsg+0x518/0xad0 crypto/algif_hash.c:102 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x308/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmmsg+0x195/0x480 net/socket.c:2506 __do_sys_sendmmsg net/socket.c:2535 [inline] __se_sys_sendmmsg net/socket.c:2532 [inline] __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2532 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x45c829 Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f6d9528ec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000004fc080 RCX: 000000000045c829 RDX: 0000000000000001 RSI: 0000000020002640 RDI: 0000000000000004 RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000008d7 R14: 00000000004cb7aa R15: 00007f6d9528f6d4 Fixes: 4b15c7075352 ("net/sched: Make etf report drops on error_queue") Fixes: 25db26a91364 ("net/sched: Introduce the ETF Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22net: qrtr: Add tracepoint supportManivannan Sadhasivam1-9/+11
Add tracepoint support for QRTR with NS as the first candidate. Later on this can be extended to core QRTR and transport drivers. The trace_printk() used in NS has been replaced by tracepoints. Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22ila: remove unused macro 'ILA_HASH_TABLE_SIZE'YueHaibing1-2/+0
net/ipv6/ila/ila_xlat.c:604:0: warning: macro "ILA_HASH_TABLE_SIZE" is not used [-Wunused-macros] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22net/sched: act_ct: update nf_conn_acct for act_ct SW offload in flowtablewenxu1-0/+2
When the act_ct SW offload in flowtable, The counter of the conntrack entry will never update. So update the nf_conn_acct conuter in act_ct flowtable software offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finishDavid Ahern2-4/+0
IPSKB_XFRM_TRANSFORMED and IP6SKB_XFRM_TRANSFORMED are skb flags set by xfrm code to tell other skb handlers that the packet has been passed through the xfrm output functions. Simplify the code and just always set them rather than conditionally based on netfilter enabled thus making the flag available for other users. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22cgroup, netclassid: remove double cond_reschedJiri Slaby1-3/+1
Commit 018d26fcd12a ("cgroup, netclassid: periodically release file_lock on classid") added a second cond_resched to write_classid indirectly by update_classid_task. Remove the one in write_classid. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21drivers: Remove inclusion of vermagic headerLeon Romanovsky1-2/+1
Get rid of linux/vermagic.h includes, so that MODULE_ARCH_VERMAGIC from the arch header arch/x86/include/asm/module.h won't be redefined. In file included from ./include/linux/module.h:30, from drivers/net/ethernet/3com/3c515.c:56: ./arch/x86/include/asm/module.h:73: warning: "MODULE_ARCH_VERMAGIC" redefined 73 | # define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY | In file included from drivers/net/ethernet/3com/3c515.c:25: ./include/linux/vermagic.h:28: note: this is the location of the previous definition 28 | #define MODULE_ARCH_VERMAGIC "" | Fixes: 6bba2e89a88c ("net/3com: Delete driver and module versions from 3com drivers") Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Shannon Nelson <snelson@pensando.io> # ionic Acked-by: Sebastian Reichel <sre@kernel.org> # power Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-4/+6
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) flow_block_cb memleak in nf_flow_table_offload_del_cb(), from Roi Dayan. 2) Fix error path handling in nf_nat_inet_register_fn(), from Hillf Danton. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21net: ipv4: remove redundant assignment to variable rcColin Ian King1-1/+1
The variable rc is being assigned with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20mptcp: drop req socket remote_key* fieldsPaolo Abeni3-17/+19
We don't need them, as we can use the current ingress opt data instead. Setting them in syn_recv_sock() may causes inconsistent mptcp socket status, as per previous commit. Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>