summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2020-04-30tcp: add hrtimer slack to sack compressionEric Dumazet3-2/+11
Add a sysctl to control hrtimer slack, default of 100 usec. This gives the opportunity to reduce system overhead, and help very short RTT flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30tcp: tcp_sack_new_ofo_skb() should be more conservativeEric Dumazet1-1/+11
Currently, tcp_sack_new_ofo_skb() sends an ack if prior acks were 'compressed', if room has to be made in tp->selective_acks[] But there is no guarantee all four sack ranges can be included in SACK option. As a matter of fact, when TCP timestamps option is used, only three SACK ranges can be included. Lets assume only two ranges can be included, and force the ack: - When we touch more than 2 ranges in the reordering done if tcp_sack_extend() could be done. - If we have at least 2 ranges when adding a new one. This enforces that before a range is in third or fourth position, at least one ACK packet included it in first/second position. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30tcp: add tp->dup_ack_counterEric Dumazet3-13/+37
In commit 86de5921a3d5 ("tcp: defer SACK compression after DupThresh") I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order to enable sack compression only after 3 dupacks. Since we plan to relax this rule for flows that involve stacks not requiring this old rule, this patch adds a distinct tp->dup_ack_counter. This means the TCP_FASTRETRANS_THRESH value is now used in a single location that a future patch can adjust: if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) { tp->dup_ack_counter++; goto send_now; } This patch also introduces tcp_sack_compress_send_ack() helper to ease following patch comprehension. This patch refines LINUX_MIB_TCPACKCOMPRESSED to not count the acks that we had to send if the timer expires or tcp_sack_compress_send_ack() is sending an ack. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30inet_diag: add support for cgroup filterDmitry Yakunin1-0/+31
This patch adds ability to filter sockets based on cgroup v2 ID. Such filter is helpful in ss utility for filtering sockets by cgroup pathname. Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30inet_diag: add cgroup id attributeDmitry Yakunin1-0/+7
This patch adds cgroup v2 ID to common inet diag message attributes. Cgroup v2 ID is kernfs ID (ino or ino+gen). This attribute allows filter inet diag output by cgroup ID obtained by name_to_handle_at() syscall. When net_cls or net_prio cgroup is activated this ID is equal to 1 (root cgroup ID) for newly created sockets. Some notes about this ID: 1) gets initialized in socket() syscall 2) incoming socket gets ID from listening socket (not during accept() syscall) 3) not changed when process get moved to another cgroup 4) can point to deleted cgroup (refcounting) v2: - use CONFIG_SOCK_CGROUP_DATA instead if CONFIG_CGROUPS v3: - fix attr size by using nla_total_size_64bit() (Eric Dumazet) - more detailed commit message (Konstantin Khlebnikov) Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-By: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29docs: networking: convert ip-sysctl.txt to ReSTMauro Carvalho Chehab2-2/+2
- add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark lists as such; - mark tables as such; - use footnote markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28net: ipv4: add sysctl for nexthop api compatibility modeRoopa Prabhu4-2/+16
Current route nexthop API maintains user space compatibility with old route API by default. Dumps and netlink notifications support both new and old API format. In systems which have moved to the new API, this compatibility mode cancels some of the performance benefits provided by the new nexthop API. This patch adds new sysctl nexthop_compat_mode which is on by default but provides the ability to turn off compatibility mode allowing systems to run entirely with the new routing API. Old route API behaviour and support is not modified by this sysctl. Uses a single sysctl to cover both ipv4 and ipv6 following other sysctls. Covers dumps and delete notifications as suggested by David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28net: ipv6: new arg skip_notify to ip6_rt_delRoopa Prabhu1-1/+1
Used in subsequent work to skip route delete notifications on nexthop deletes. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26tcp: mptcp: use mptcp receive buffer space to select rcv windowFlorian Westphal1-2/+6
In MPTCP, the receive window is shared across all subflows, because it refers to the mptcp-level sequence space. MPTCP receivers already place incoming packets on the mptcp socket receive queue and will charge it to the mptcp socket rcvbuf until userspace consumes the data. Update __tcp_select_window to use the occupancy of the parent/mptcp socket instead of the subflow socket in case the tcp socket is part of a logical mptcp connection. This commit doesn't change choice of initial window for passive or active connections. While it would be possible to change those as well, this adds complexity (especially when handling MP_JOIN requests). Furthermore, the MPTCP RFC specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field of a TCP segment at the connection level if it does not also carry a DSS option with a Data ACK field.' SYN/SYNACK packets do not carry a DSS option with a Data ACK field. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-5/+3
Simple overlapping changes to linux/vermagic.h Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23ipv4: Update fib_select_default to handle nexthop objectsDavid Ahern1-3/+3
A user reported [0] hitting the WARN_ON in fib_info_nh: [ 8633.839816] ------------[ cut here ]------------ [ 8633.839819] WARNING: CPU: 0 PID: 1719 at include/net/nexthop.h:251 fib_select_path+0x303/0x381 ... [ 8633.839846] RIP: 0010:fib_select_path+0x303/0x381 ... [ 8633.839848] RSP: 0018:ffffb04d407f7d00 EFLAGS: 00010286 [ 8633.839850] RAX: 0000000000000000 RBX: ffff9460b9897ee8 RCX: 00000000000000fe [ 8633.839851] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000000 [ 8633.839852] RBP: ffff946076049850 R08: 0000000059263a83 R09: ffff9460840e4000 [ 8633.839853] R10: 0000000000000014 R11: 0000000000000000 R12: ffffb04d407f7dc0 [ 8633.839854] R13: ffffffffa4ce3240 R14: 0000000000000000 R15: ffff9460b7681f60 [ 8633.839857] FS: 00007fcac2e02700(0000) GS:ffff9460bdc00000(0000) knlGS:0000000000000000 [ 8633.839858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8633.839859] CR2: 00007f27beb77e28 CR3: 0000000077734000 CR4: 00000000000006f0 [ 8633.839867] Call Trace: [ 8633.839871] ip_route_output_key_hash_rcu+0x421/0x890 [ 8633.839873] ip_route_output_key_hash+0x5e/0x80 [ 8633.839876] ip_route_output_flow+0x1a/0x50 [ 8633.839878] __ip4_datagram_connect+0x154/0x310 [ 8633.839880] ip4_datagram_connect+0x28/0x40 [ 8633.839882] __sys_connect+0xd6/0x100 ... The WARN_ON is triggered in fib_select_default which is invoked when there are multiple default routes. Update the function to use fib_info_nhc and convert the nexthop checks to use fib_nh_common. Add test case that covers the affected code path. [0] https://github.com/FRRouting/frr/issues/6089 Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finishDavid Ahern1-2/+0
IPSKB_XFRM_TRANSFORMED and IP6SKB_XFRM_TRANSFORMED are skb flags set by xfrm code to tell other skb handlers that the packet has been passed through the xfrm output functions. Simplify the code and just always set them rather than conditionally based on netfilter enabled thus making the flag available for other users. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21net: ipv4: remove redundant assignment to variable rcColin Ian King1-1/+1
The variable rc is being assigned with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09net: ipv4: devinet: Fix crash when add/del multicast IP with autojoinTaras Chornyi1-4/+9
When CONFIG_IP_MULTICAST is not set and multicast ip is added to the device with autojoin flag or when multicast ip is deleted kernel will crash. steps to reproduce: ip addr add 224.0.0.0/32 dev eth0 ip addr del 224.0.0.0/32 dev eth0 or ip addr add 224.0.0.0/32 dev eth0 autojoin Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088 pc : _raw_write_lock_irqsave+0x1e0/0x2ac lr : lock_sock_nested+0x1c/0x60 Call trace: _raw_write_lock_irqsave+0x1e0/0x2ac lock_sock_nested+0x1c/0x60 ip_mc_config.isra.28+0x50/0xe0 inet_rtm_deladdr+0x1a8/0x1f0 rtnetlink_rcv_msg+0x120/0x350 netlink_rcv_skb+0x58/0x120 rtnetlink_rcv+0x14/0x20 netlink_unicast+0x1b8/0x270 netlink_sendmsg+0x1a0/0x3b0 ____sys_sendmsg+0x248/0x290 ___sys_sendmsg+0x80/0xc0 __sys_sendmsg+0x68/0xc0 __arm64_sys_sendmsg+0x20/0x30 el0_svc_common.constprop.2+0x88/0x150 do_el0_svc+0x20/0x80 el0_sync_handler+0x118/0x190 el0_sync+0x140/0x180 Fixes: 93a714d6b53d ("multicast: Extend ip address command to enable multicast group join/leave on") Signed-off-by: Taras Chornyi <taras.chornyi@plvision.eu> Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds45-515/+705
Pull networking updates from David Miller: "Highlights: 1) Fix the iwlwifi regression, from Johannes Berg. 2) Support BSS coloring and 802.11 encapsulation offloading in hardware, from John Crispin. 3) Fix some potential Spectre issues in qtnfmac, from Sergey Matyukevich. 4) Add TTL decrement action to openvswitch, from Matteo Croce. 5) Allow paralleization through flow_action setup by not taking the RTNL mutex, from Vlad Buslov. 6) A lot of zero-length array to flexible-array conversions, from Gustavo A. R. Silva. 7) Align XDP statistics names across several drivers for consistency, from Lorenzo Bianconi. 8) Add various pieces of infrastructure for offloading conntrack, and make use of it in mlx5 driver, from Paul Blakey. 9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki. 10) Lots of parallelization improvements during configuration changes in mlxsw driver, from Ido Schimmel. 11) Add support to devlink for generic packet traps, which report packets dropped during ACL processing. And use them in mlxsw driver. From Jiri Pirko. 12) Support bcmgenet on ACPI, from Jeremy Linton. 13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei Starovoitov, and your's truly. 14) Support XDP meta-data in virtio_net, from Yuya Kusakabe. 15) Fix sysfs permissions when network devices change namespaces, from Christian Brauner. 16) Add a flags element to ethtool_ops so that drivers can more simply indicate which coalescing parameters they actually support, and therefore the generic layer can validate the user's ethtool request. Use this in all drivers, from Jakub Kicinski. 17) Offload FIFO qdisc in mlxsw, from Petr Machata. 18) Support UDP sockets in sockmap, from Lorenz Bauer. 19) Fix stretch ACK bugs in several TCP congestion control modules, from Pengcheng Yang. 20) Support virtual functiosn in octeontx2 driver, from Tomasz Duszynski. 21) Add region operations for devlink and use it in ice driver to dump NVM contents, from Jacob Keller. 22) Add support for hw offload of MACSEC, from Antoine Tenart. 23) Add support for BPF programs that can be attached to LSM hooks, from KP Singh. 24) Support for multiple paths, path managers, and counters in MPTCP. From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti, and others. 25) More progress on adding the netlink interface to ethtool, from Michal Kubecek" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits) net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline cxgb4/chcr: nic-tls stats in ethtool net: dsa: fix oops while probing Marvell DSA switches net/bpfilter: remove superfluous testing message net: macb: Fix handling of fixed-link node net: dsa: ksz: Select KSZ protocol tag netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write net: stmmac: add EHL 2.5Gbps PCI info and PCI ID net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID net: stmmac: create dwmac-intel.c to contain all Intel platform net: dsa: bcm_sf2: Support specifying VLAN tag egress rule net: dsa: bcm_sf2: Add support for matching VLAN TCI net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT net: dsa: bcm_sf2: Disable learning for ASP port net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge net: dsa: b53: Prevent tagged VLAN on port 7 for 7278 net: dsa: b53: Restore VLAN entries upon (re)configuration net: dsa: bcm_sf2: Fix overflow checks hv_netvsc: Remove unnecessary round_up for recv_completion_cnt ...
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller3-5/+5
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-79/+115
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: Track socket refcounts in skb_steal_sock()Joe Stringer1-2/+4
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make the determination of whether the socket is reference counted in the case where it is prefetched by earlier logic such as early_demux. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
2020-03-30bpf: Add socket assign supportJoe Stringer1-1/+2
Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30Merge tag 'docs-5.7' of git://git.lwn.net/linuxLinus Torvalds2-4/+4
Pull documentation updates from Jonathan Corbet: "This has been a busy cycle for documentation work. Highlights include: - Lots of RST conversion work by Mauro, Daniel ALmeida, and others. Maybe someday we'll get to the end of this stuff...maybe... - Some organizational work to bring some order to the core-api manual. - Various new docs and additions to the existing documentation. - Typo fixes, warning fixes, ..." * tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits) Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT MAINTAINERS: adjust to filesystem doc ReST conversion docs: deprecated.rst: Add BUG()-family doc: zh_CN: add translation for virtiofs doc: zh_CN: index files in filesystems subdirectory docs: locking: Drop :c:func: throughout docs: locking: Add 'need' to hardirq section docs: conf.py: avoid thousands of duplicate label warning on Sphinx docs: prevent warnings due to autosectionlabel docs: fix reference to core-api/namespaces.rst docs: fix pointers to io-mapping.rst and io_ordering.rst files Documentation: Better document the softlockup_panic sysctl docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref docs: perf: imx-ddr.rst: get rid of a warning docs: filesystems: fuse.rst: supress a Sphinx warning docs: translations: it: avoid duplicate refs at programming-language.rst docs: driver.rst: supress two ReSt warnings docs: trace: events.rst: convert some new stuff to ReST format Documentation: Add io_ordering.rst to driver-api manual Documentation: Add io-mapping.rst to driver-api manual ...
2020-03-30Merge branch 'master' of ↵David S. Miller2-16/+32
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-03-28 1) Use kmem_cache_zalloc() instead of kmem_cache_alloc() in xfrm_state_alloc(). From Huang Zijiang. 2) esp_output_fill_trailer() is the same in IPv4 and IPv6, so share this function to avoide code duplcation. From Raed Salem. 3) Add offload support for esp beet mode. From Xin Long. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30udp: initialize is_flist with 0 in udp_gro_receiveXin Long1-0/+1
Without NAPI_GRO_CB(skb)->is_flist initialized, when the dev doesn't support NETIF_F_GRO_FRAGLIST, is_flist can still be set and fraglist will be used in udp_gro_receive(). So fix it by initializing is_flist with 0 in udp_gro_receive. Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: add net available in build_stateAlexander Aring4-14/+16
The build_state callback of lwtunnel doesn't contain the net namespace structure yet. This patch will add it so we can check on specific address configuration at creation time of rpl source routes. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net, ip_tunnel: fix interface lookup with no keyWilliam Dauchy1-5/+1
when creating a new ipip interface with no local/remote configuration, the lookup is done with TUNNEL_NO_KEY flag, making it impossible to match the new interface (only possible match being fallback or metada case interface); e.g: `ip link add tunl1 type ipip dev eth0` To fix this case, adding a flag check before the key comparison so we permit to match an interface with no local/remote config; it also avoids breaking possible userland tools relying on TUNNEL_NO_KEY flag and uninitialised key. context being on my side, I'm creating an extra ipip interface attached to the physical one, and moving it to a dedicated namespace. Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Signed-off-by: William Dauchy <w.dauchy@criteo.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30mptcp: add and use MIB counter infrastructureFlorian Westphal2-0/+6
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s" or "nstat" will show them automatically. The MPTCP MIB counters are allocated in a distinct pcpu area in order to avoid bloating/wasting TCP pcpu memory. Counters are allocated once the first MPTCP socket is created in a network namespace and free'd on exit. If no sockets have been allocated, all-zero mptcp counters are shown. The MIB counter list is taken from the multipath-tcp.org kernel, but only a few counters have been picked up so far. The counter list can be increased at any time later on. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30mptcp: Add handling of incoming MP_JOIN requestsPeter Krystad1-0/+6
Process the MP_JOIN option in a SYN packet with the same flow as MP_CAPABLE but when the third ACK is received add the subflow to the MPTCP socket subflow list instead of adding it to the TCP socket accept queue. The subflow is added at the end of the subflow list so it will not interfere with the existing subflows operation and no data is expected to be transmitted on it. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: Fix typo of SKB_SGO_CB_OFFSETCambda Zhu2-2/+2
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30ipv4: fix a RCU-list lock in fib_triestat_seq_showQian Cai1-0/+3
fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head, tb_hlist) without rcu_read_lock() will trigger a warning, net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by proc01/115277: #0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670 Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 fib_triestat_seq_show+0x750/0x880 seq_read+0x1a0/0x670 proc_reg_read+0x10c/0x1b0 __vfs_read+0x3c/0x70 vfs_read+0xac/0x170 ksys_read+0x7c/0x140 system_call+0x5c/0x68 Fix it by adding a pair of rcu_read_lock/unlock() and use cond_resched_rcu() to avoid the situation where walking of a large number of items may prevent scheduling for a long time. Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-13/+33
Minor comment conflict in mac80211. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-5/+2
Daniel Borkmann says: ==================== pull-request: bpf 2020-03-27 The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 4 day(s) which contain a total of 4 files changed, 25 insertions(+), 20 deletions(-). The main changes are: 1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid having to rely on compiler to do so. Issues have been noticed on some compilers with padding and other oddities where the request was then unexpectedly rejected, from Greg Kroah-Hartman. 2) Sanitize the bpf_struct_ops TCP congestion control name in order to avoid problematic characters such as whitespaces, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-28Merge branch 'master' of ↵David S. Miller2-8/+31
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2020-03-27 1) Handle NETDEV_UNREGISTER for xfrm device to handle asynchronous unregister events cleanly. From Raed Salem. 2) Fix vti6 tunnel inter address family TX through bpf_redirect(). From Nicolas Dichtel. 3) Fix lenght check in verify_sec_ctx_len() to avoid a slab-out-of-bounds. From Xin Long. 4) Add a missing verify_sec_ctx_len check in xfrm_add_acquire to avoid a possible out-of-bounds to access. From Xin Long. 5) Use built-in RCU list checking of hlist_for_each_entry_rcu to silence false lockdep warning in __xfrm6_tunnel_spi_lookup when CONFIG_PROVE_RCU_LIST is enabled. From Madhuparna Bhowmik. 6) Fix a panic on esp offload when crypto is done asynchronously. From Xin Long. 7) Fix a skb memory leak in an error path of vti6_rcv. From Torsten Hilbrich. 8) Fix a race that can lead to a doulbe free in xfrm_policy_timer. From Xin Long. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26esp4: add gso_segment for esp4 beet modeXin Long1-0/+32
Similar to xfrm4_tunnel/transport_gso_segment(), _gso_segment() is added to do gso_segment for esp4 beet mode. Before calling inet_offloads[proto]->callbacks.gso_segment, it needs to do: - Get the upper proto from ph header to get its gso_segment when xo->proto is IPPROTO_BEETPH. - Add SKB_GSO_TCPV4 to gso_type if x->sel.family == AF_INET6 and the proto == IPPROTO_TCP, so that the current tcp ipv4 packet can be segmented. - Calculate a right value for skb->transport_header and move skb->data to the transport header position. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller4-21/+102
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24Remove DST_HOSTDavid Laight1-8/+5
Previous changes to the IP routing code have removed all the tests for the DS_HOST route flag. Remove the flags and all the code that sets it. Signed-off-by: David Laight <david.laight@aculab.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24ipv4: fix a RCU-list lock in inet_dump_fib()Qian Cai1-0/+2
There is a place, inet_dump_fib() fib_table_dump fn_trie_dump_leaf() hlist_for_each_entry_rcu() without rcu_read_lock() will trigger a warning, WARNING: suspicious RCU usage ----------------------------- net/ipv4/fib_trie.c:2216 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/1923: #0: ffffffff8ce76e40 (rtnl_mutex){+.+.}, at: netlink_dump+0xd6/0x840 Call Trace: dump_stack+0xa1/0xea lockdep_rcu_suspicious+0x103/0x10d fn_trie_dump_leaf+0x581/0x590 fib_table_dump+0x15f/0x220 inet_dump_fib+0x4ad/0x5d0 netlink_dump+0x350/0x840 __netlink_dump_start+0x315/0x3e0 rtnetlink_rcv_msg+0x4d1/0x720 netlink_rcv_skb+0xf0/0x220 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x306/0x460 netlink_sendmsg+0x44b/0x770 __sys_sendto+0x259/0x270 __x64_sys_sendto+0x80/0xa0 do_syscall_64+0x69/0xf4 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fixes: 18a8021a7be3 ("net/ipv4: Plumb support for filtering route dumps") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23tcp: repair: fix TCP_QUEUE_SEQ implementationEric Dumazet1-1/+3
When application uses TCP_QUEUE_SEQ socket option to change tp->rcv_next, we must also update tp->copied_seq. Otherwise, stuff relying on tcp_inq() being precise can eventually be confused. For example, tcp_zerocopy_receive() might crash because it does not expect tcp_recv_skb() to return NULL. We could add tests in various places to fix the issue, or simply make sure tcp_inq() wont return a random value, and leave fast path as it is. Note that this fixes ioctl(fd, SIOCINQ, &val) at the same time. Fixes: ee9952831cfd ("tcp: Initial repair mode") Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23bpf: Add bpf_sk_storage support to bpf_tcp_caMartin KaFai Lau1-0/+33
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete() helper to the bpf_tcp_ca's struct_ops. That would allow bpf-tcp-cc to: 1) share sk private data with other bpf progs. 2) use bpf_sk_storage as a private storage for a bpf-tcp-cc if the existing icsk_ca_priv is not big enough. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320152101.2169498-1-kafai@fb.com
2020-03-21tcp: also NULL skb->dev when copy was neededFlorian Westphal1-2/+6
In rare cases retransmit logic will make a full skb copy, which will not trigger the zeroing added in recent change b738a185beaa ("tcp: ensure skb->dev is NULL before leaving TCP stack"). Cc: Eric Dumazet <edumazet@google.com> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20bpf, tcp: Make tcp_bpf_recvmsg staticYueHaibing1-62/+62
After commit f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of tcp_bpf.c, so make it static and remove it from tcp.h. Also move it to BPF_STREAM_PARSER #ifdef to fix unused function warnings. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20bpf, tcp: Fix unused function warningsYueHaibing1-14/+14
If BPF_STREAM_PARSER is not set, gcc warns: net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function] Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
2020-03-20tcp: ensure skb->dev is NULL before leaving TCP stackEric Dumazet1-0/+4
skb->rbnode is sharing three skb fields : next, prev, dev When a packet is sent, TCP keeps the original skb (master) in a rtx queue, which was converted to rbtree a while back. __tcp_transmit_skb() is responsible to clone the master skb, and add the TCP header to the clone before sending it to network layer. skb_clone() already clears skb->next and skb->prev, but copies the master oskb->dev into the clone. We need to clear skb->dev, otherwise lower layers could interpret the value as a pointer to a netdev. This old bug surfaced recently when commit 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") was merged. Before this netfilter commit, skb->dev value was ignored and changed before reaching dev_queue_xmit() Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Martin Zaharinov <micron10@gmail.com> Cc: Florian Westphal <fw@strlen.de> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-4/+4
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey. 2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi. Follow up patch to clean up this new mode, from Dan Carpenter. 3) Add support for geneve tunnel options, from Xin Long. 4) Make sets built-in and remove modular infrastructure for sets, from Florian Westphal. 5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing. 6) Statify nft_pipapo_get, from Chen Wandun. 7) Use C99 flexible-array member, from Gustavo A. R. Silva. 8) More descriptive variable names for bitwise, from Jeremy Sowden. 9) Four patches to add tunnel device hardware offload to the flowtable infrastructure, from wenxu. 10) pipapo set supports for 8-bit grouping, from Stefano Brivio. 11) pipapo can switch between nibble and byte grouping, also from Stefano. 12) Add AVX2 vectorized version of pipapo, from Stefano Brivio. 13) Update pipapo to be use it for single ranges, from Stefano. 14) Add stateful expression support to elements via control plane, eg. counter per element. 15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal. 15) Add new egress hook, from Lukas Wunner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17bpf: Sanitize the bpf_struct_ops tcp-cc nameMartin KaFai Lau1-5/+2
The bpf_struct_ops tcp-cc name should be sanitized in order to avoid problematic chars (e.g. whitespaces). This patch reuses the bpf_obj_name_cpy() for accepting the same set of characters in order to keep a consistent bpf programming experience. A "size" param is added. Also, the strlen is returned on success so that the caller (like the bpf_tcp_ca here) can error out on empty name. The existing callers of the bpf_obj_name_cpy() only need to change the testing statement to "if (err < 0)". For all these existing callers, the err will be overwritten later, so no extra change is needed for the new strlen return value. v3: - reverse xmas tree style v2: - Save the orig_src to avoid "end - size" (Andrii) Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200314010209.1131542-1-kafai@fb.com
2020-03-17tcp: fix stretch ACK bugs in YeahPengcheng Yang1-30/+11
Change Yeah to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). In addition, we re-implemented the scalable path using tcp_cong_avoid_ai() and removed the pkts_acked variable. Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17tcp: fix stretch ACK bugs in VenoPengcheng Yang1-4/+5
Change Veno to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17tcp: stretch ACK fixes in Veno prepPengcheng Yang1-21/+23
No code logic has been changed in this patch. Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17tcp: fix stretch ACK bugs in ScalablePengcheng Yang1-8/+9
Change Scalable to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). In addition, because we are now precisely accounting for stretch ACKs, including delayed ACKs, we can now change TCP_SCALABLE_AI_CNT to 100. Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17tcp: fix stretch ACK bugs in BICPengcheng Yang1-5/+6
Changes BIC to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: ip_gre: Accept IFLA_INFO_DATA-less configurationPetr Machata1-0/+2
The fix referenced below causes a crash when an ERSPAN tunnel is created without passing IFLA_INFO_DATA. Fix by validating passed-in data in the same way as ipgre does. Fixes: e1f8f78ffe98 ("net: ip_gre: Separate ERSPAN newlink / changelink callbacks") Reported-by: syzbot+1b4ebf4dae4e510dd219@syzkaller.appspotmail.com Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15netfilter: Replace zero-length array with flexible-array memberGustavo A. R. Silva2-4/+4
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix checkpatch.pl warning WARNING: __aligned(size) is preferred over __attribute__((aligned(size))) in net/bridge/netfilter/ebtables.c This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>