summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2022-04-22ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().Guillaume Nault1-2/+2
All callers already initialise ->flowi4_scope with RT_SCOPE_UNIVERSE, either by manual field assignment, memset(0) of the whole structure or implicit structure initialisation of on-stack variables (RT_SCOPE_UNIVERSE actually equals 0). Therefore, we don't need to always initialise ->flowi4_scope in ip_rt_fix_tos(). We only need to reduce the scope to RT_SCOPE_LINK when the special RTO_ONLINK flag is present in the tos. This will allow some code simplification, like removing ip_rt_fix_tos(). Also, the long term idea is to remove RTO_ONLINK entirely by properly initialising ->flowi4_scope, instead of overloading ->flowi4_tos with a special flag. Eventually, this will allow to convert ->flowi4_tos to dscp_t. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni3-9/+9
drivers/net/ethernet/microchip/lan966x/lan966x_main.c d08ed852560e ("net: lan966x: Make sure to release ptp interrupt") c8349639324a ("net: lan966x: Add FDMA functionality") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-18tcp: fix signed/unsigned comparisonEric Dumazet1-2/+3
Kernel test robot reported: smatch warnings: net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero. I actually had one packetdrill failing because of this bug, and was about to send the fix :) v2: Andreas Schwab also pointed out that @reason needs to be negated before we reach tcp_drop_reason() Fixes: 4b506af9c5b8 ("tcp: add two drop reasons for tcp_ack()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_ofo_queue()Eric Dumazet1-7/+2
packets in OFO queue might be redundant, and dropped. tcp_drop() is no longer needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reasons to tcp_rcv_synsent_state_process()Eric Dumazet1-6/+9
Re-use existing reasons. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: make tcp_rcv_synsent_state_process() drop monitor friendEric Dumazet1-9/+8
1) A valid RST packet should be consumed, to not confuse drop monitor. 2) Same remark for packet validating cross syn setup, even if we might ignore part of it. 3) When third packet of 3WHS is delayed, do not pretend the SYNACK was dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_prune_ofo_queue()Eric Dumazet1-1/+2
Add one reason for packets dropped from OFO queue because of memory pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add two drop reasons for tcp_ack()Eric Dumazet1-3/+4
Add TCP_TOO_OLD_ACK and TCP_ACK_UNSENT_DATA drop reasons so that tcp_rcv_established() can report them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reasons to tcp_rcv_state_process()Eric Dumazet1-7/+17
Add basic support for drop reasons in tcp_rcv_state_process() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: make tcp_rcv_state_process() drop monitor friendlyEric Dumazet1-3/+7
tcp_rcv_state_process() incorrectly drops packets instead of consuming it, making drop monitor very noisy, if not unusable. Calling tcp_time_wait() or tcp_done() is part of standard behavior, packets triggering these actions were not dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_validate_incoming()Eric Dumazet1-1/+6
Creates four new drop reasons for the following cases: 1) packet being rejected by RFC 7323 PAWS check 2) packet being rejected by SEQUENCE check 3) Invalid RST packet 4) Invalid SYN packet Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: get rid of rst_seq_matchEric Dumazet1-8/+5
Small cleanup in tcp_validate_incoming(), no need for rst_seq_match setting and testing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: consume incoming skb leading to a resetEric Dumazet1-12/+16
Whenever tcp_validate_incoming() handles a valid RST packet, we should not pretend the packet was dropped. Create a special section at the end of tcp_validate_incoming() to handle this case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-16net: Handle l3mdev in ip_tunnel_init_flowDavid Ahern2-6/+7
Ido reported that the commit referenced in the Fixes tag broke a gre use case with dummy devices. Add a check to ip_tunnel_init_flow to see if the oif is an l3mdev port and if so set the oif to 0 to avoid the oif comparison in fib_lookup_good_nhc. Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-13net: ip: add skb drop reasons to ip forwardingMenglong Dong1-3/+10
Replace kfree_skb() which is used in ip6_forward() and ip_forward() with kfree_skb_reason(). The new drop reason 'SKB_DROP_REASON_PKT_TOO_BIG' is introduced for the case that the length of the packet exceeds MTU and can't fragment. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: ipv4: add skb drop reasons to ip_error()Menglong Dong1-1/+5
Eventually, I find out the handler function for inputting route lookup fail: ip_error(). The drop reasons we used in ip_error() are almost corresponding to IPSTATS_MIB_*, and following new reasons are introduced: SKB_DROP_REASON_IP_INADDRERRORS SKB_DROP_REASON_IP_INNOROUTES Isn't the name SKB_DROP_REASON_IP_HOSTUNREACH and SKB_DROP_REASON_IP_NETUNREACH more accurate? To make them corresponding to IPSTATS_MIB_*, we keep their name still. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13esp: limit skb_page_frag_refill use to a single pageSabrina Dubroca1-3/+2
Commit ebe48d368e97 ("esp: Fix possible buffer overflow in ESP transformation") tried to fix skb_page_frag_refill usage in ESP by capping allocsize to 32k, but that doesn't completely solve the issue, as skb_page_frag_refill may return a single page. If that happens, we will write out of bounds, despite the check introduced in the previous patch. This patch forces COW in cases where we would end up calling skb_page_frag_refill with a size larger than a page (first in esp_output_head with tailen, then in esp_output_tail with skb->data_len). Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-04-13fou: Remove XRFM from NET_FOU KconfigCoco Li2-2/+0
XRFM is no longer needed for configuring FOU tunnels (CONFIG_NET_FOU_IP_TUNNELS), remove from Kconfig. Also remove the xrfm.h dependency in fou.c. It was added in '23461551c006 ("fou: Support for foo-over-udp RX path")' for depencies of udp_del_offload and udp_offloads, which were removed in 'd92283e338f6 ("fou: change to use UDP socket GRO")'. Built and installed kernel and setup GUE/FOU tunnels. Signed-off-by: Coco Li <lixiaoyan@google.com> Link: https://lore.kernel.org/r/20220411213717.3688789-1-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-12net: remove noblock parameter from recvmsg() entitiesOliver Hartkopp8-44/+35
The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg(). Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g. err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); or in err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition). Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-12ipv4: Use dscp_t in struct fib_entry_notifier_infoGuillaume Nault1-2/+2
Use the new dscp_t type to replace the tos field of struct fib_entry_notifier_info. This ensures ECN bits are ignored and makes it compatible with the dscp field of struct fib_rt_info. This also allows sparse to flag potential incorrect uses of DSCP and ECN bits. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-12ipv4: Use dscp_t in struct fib_rt_infoGuillaume Nault3-7/+7
Use the new dscp_t type to replace the tos field of struct fib_rt_info. This ensures ECN bits are ignored and makes it compatible with the fa_dscp field of struct fib_alias. This also allows sparse to flag potential incorrect uses of DSCP and ECN bits. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller1-0/+4
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Replace unnecessary list_for_each_entry_continue() in nf_tables, from Jakob Koschel. 2) Add struct nf_conntrack_net_ecache to conntrack event cache and use it, from Florian Westphal. 3) Refactor ctnetlink_dump_list(), also from Florian. 4) Bump module reference counter on cttimeout object addition/removal, from Florian. 5) Consolidate nf_log MAC printer, from Phil Sutter. 6) Add basic logging support for unknown ethertype, from Phil Sutter. 7) Consolidate check for sysctl nf_log_all_netns toggle, also from Phil. 8) Replace hardcode value in nft_bitwise, from Jeremy Sowden. 9) Rename BASIC-like goto tags in nft_bitwise to more meaningful names, also from Jeremy. 10) nft_fib support for reverse path filtering with policy-based routing on iif. Extend selftests to cover for this new usecase, from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11netfilter: nft_fib: reverse path filter for policy-based routing on iifPablo Neira Ayuso1-0/+4
If policy-based routing using the iif selector is used, then the fib expression fails to look up for the reverse path from the prerouting hook because the input interface cannot be inferred. In order to support this scenario, extend the fib expression to allow to use after the route lookup, from the forward hook. This patch also adds support for the input hook for usability reasons. Since the prerouting hook cannot be used for the scenario described above, users need two rules: one for the forward chain and another rule for the input chain to check for the reverse path check for locally targeted traffic. Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-11net: icmp: add skb drop reasons to icmp protocolMenglong Dong2-37/+52
Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with kfree_skb_reason(). In order to get the reasons of the skb drops after icmp message handle, we change the return type of 'handler()' in 'struct icmp_control' from 'bool' to 'enum skb_drop_reason'. This may change its original intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means success now. Therefore, all 'handler' and the call of them need to be handled. Following 'handler' functions are involved: icmp_unreach() icmp_redirect() icmp_echo() icmp_timestamp() icmp_discard() And following new drop reasons are added: SKB_DROP_REASON_ICMP_CSUM SKB_DROP_REASON_INVALID_PROTO The reason 'INVALID_PROTO' is introduced for the case that the packet doesn't follow rfc 1122 and is dropped. This is not a common case, and I believe we can locate the problem from the data in the packet. For now, this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types. Maybe there should be a document file for these reasons. For example, list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO' drop reason. Therefore, users can locate their problems according to the document. Reviewed-by: Hao Peng <flyingpeng@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11net: icmp: introduce __ping_queue_rcv_skb() to report drop reasonsMenglong Dong1-5/+13
In order to avoid to change the return value of ping_queue_rcv_skb(), introduce the function __ping_queue_rcv_skb(), which is able to report the reasons of skb drop as its return value, as Paolo suggested. Meanwhile, make ping_queue_rcv_skb() a simple call to __ping_queue_rcv_skb(). The kfree_skb() and sock_queue_rcv_skb() used in ping_queue_rcv_skb() are replaced with kfree_skb_reason() and sock_queue_rcv_skb_reason() now. Reviewed-by: Hao Peng <flyingpeng@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+6
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-08tcp: Add tracepoint for tcp_set_ca_statePing Gan1-0/+12
The congestion status of a tcp flow may be updated since there is congestion between tcp sender and receiver. It makes sense to add tracepoint for congestion status set function to summate cc status duration and evaluate the performance of network and congestion algorithm. the backgound of this patch is below. Link: https://github.com/iovisor/bcc/pull/3899 Signed-off-by: Ping Gan <jacky_gam_2001@163.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220406010956.19656-1-jacky_gam_2001@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-08net-core: rx_otherhost_dropped to core_statsJeffrey Ji1-0/+1
Increment rx_otherhost_dropped counter when packet dropped due to mismatched dest MAC addr. An example when this drop can occur is when manually crafting raw packets that will be consumed by a user space application via a tap device. For testing purposes local traffic was generated using trafgen for the client and netcat to start a server Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the counter was incremented. (Also had to modify iproute2 to show the stat, additional patch for that coming next.) Signed-off-by: Jeffrey Ji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-06tcp: add accessors to read/set tp->snd_cwndEric Dumazet23-185/+190
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero. Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug. Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-06net: remove noblock parameter from skb_recv_datagram()Oliver Hartkopp2-2/+4
skb_recv_datagram() has two parameters 'flags' and 'noblock' that are merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)' As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags' into 'flags' and 'noblock' with finally obsolete bit operations like this: skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc); And this is not even done consistently with the 'flags' parameter. This patch removes the obsolete and costly splitting into two parameters and only performs bit operations when really needed on the caller side. One missing conversion thankfully reported by kernel test robot. I missed to enable kunit tests to build the mctp code. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01net: ipv4: fix route with nexthop object delete warningNikolay Aleksandrov1-1/+6
FRR folks have hit a kernel warning[1] while deleting routes[2] which is caused by trying to delete a route pointing to a nexthop id without specifying nhid but matching on an interface. That is, a route is found but we hit a warning while matching it. The warning is from fib_info_nh() in include/net/nexthop.h because we run it on a fib_info with nexthop object. The call chain is: inet_rtm_delroute -> fib_table_delete -> fib_nh_match (called with a nexthop fib_info and also with fc_oif set thus calling fib_info_nh on the fib_info and triggering the warning). The fix is to not do any matching in that branch if the fi has a nexthop object because those are managed separately. I.e. we should match when deleting without nh spec and should fail when deleting a nexthop route with old-style nh spec because nexthop objects are managed separately, e.g.: $ ip r show 1.2.3.4/32 1.2.3.4 nhid 12 via 192.168.11.2 dev dummy0 $ ip r del 1.2.3.4/32 $ ip r del 1.2.3.4/32 nhid 12 <both should work> $ ip r del 1.2.3.4/32 dev dummy0 <should fail with ESRCH> [1] [ 523.462226] ------------[ cut here ]------------ [ 523.462230] WARNING: CPU: 14 PID: 22893 at include/net/nexthop.h:468 fib_nh_match+0x210/0x460 [ 523.462236] Modules linked in: dummy rpcsec_gss_krb5 xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw iptable_raw bpf_preload xt_statistic ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark nf_tables xt_nat veth nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay dm_crypt nfsv3 nfs fscache netfs vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack 8021q garp mrp ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core ip6table_filter xt_comment ip6_tables vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr bnep binfmt_misc xfs vfat fat squashfs loop nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb btrtl iwlmvm uvcvideo btbcm snd_hda_intel edac_mce_amd [ 523.462274] videobuf2_vmalloc videobuf2_memops btintel snd_intel_dspcfg videobuf2_v4l2 snd_intel_sdw_acpi bluetooth snd_usb_audio snd_hda_codec mac80211 snd_usbmidi_lib joydev snd_hda_core videobuf2_common kvm_amd snd_rawmidi snd_hwdep snd_seq videodev ccp snd_seq_device libarc4 ecdh_generic mc snd_pcm kvm iwlwifi snd_timer drm_kms_helper snd cfg80211 cec soundcore irqbypass rapl wmi_bmof i2c_piix4 rfkill k10temp pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sp5100_tco r8169 nvme_core wmi ipmi_devintf ipmi_msghandler fuse [ 523.462300] CPU: 14 PID: 22893 Comm: ip Tainted: P OE 5.16.18-200.fc35.x86_64 #1 [ 523.462302] Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.C0 10/29/2020 [ 523.462303] RIP: 0010:fib_nh_match+0x210/0x460 [ 523.462304] Code: 7c 24 20 48 8b b5 90 00 00 00 e8 bb ee f4 ff 48 8b 7c 24 20 41 89 c4 e8 ee eb f4 ff 45 85 e4 0f 85 2e fe ff ff e9 4c ff ff ff <0f> 0b e9 17 ff ff ff 3c 0a 0f 85 61 fe ff ff 48 8b b5 98 00 00 00 [ 523.462306] RSP: 0018:ffffaa53d4d87928 EFLAGS: 00010286 [ 523.462307] RAX: 0000000000000000 RBX: ffffaa53d4d87a90 RCX: ffffaa53d4d87bb0 [ 523.462308] RDX: ffff9e3d2ee6be80 RSI: ffffaa53d4d87a90 RDI: ffffffff920ed380 [ 523.462309] RBP: ffff9e3d2ee6be80 R08: 0000000000000064 R09: 0000000000000000 [ 523.462310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000031 [ 523.462310] R13: 0000000000000020 R14: 0000000000000000 R15: ffff9e3d331054e0 [ 523.462311] FS: 00007f245517c1c0(0000) GS:ffff9e492ed80000(0000) knlGS:0000000000000000 [ 523.462313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 523.462313] CR2: 000055e5dfdd8268 CR3: 00000003ef488000 CR4: 0000000000350ee0 [ 523.462315] Call Trace: [ 523.462316] <TASK> [ 523.462320] fib_table_delete+0x1a9/0x310 [ 523.462323] inet_rtm_delroute+0x93/0x110 [ 523.462325] rtnetlink_rcv_msg+0x133/0x370 [ 523.462327] ? _copy_to_iter+0xb5/0x6f0 [ 523.462330] ? rtnl_calcit.isra.0+0x110/0x110 [ 523.462331] netlink_rcv_skb+0x50/0xf0 [ 523.462334] netlink_unicast+0x211/0x330 [ 523.462336] netlink_sendmsg+0x23f/0x480 [ 523.462338] sock_sendmsg+0x5e/0x60 [ 523.462340] ____sys_sendmsg+0x22c/0x270 [ 523.462341] ? import_iovec+0x17/0x20 [ 523.462343] ? sendmsg_copy_msghdr+0x59/0x90 [ 523.462344] ? __mod_lruvec_page_state+0x85/0x110 [ 523.462348] ___sys_sendmsg+0x81/0xc0 [ 523.462350] ? netlink_seq_start+0x70/0x70 [ 523.462352] ? __dentry_kill+0x13a/0x180 [ 523.462354] ? __fput+0xff/0x250 [ 523.462356] __sys_sendmsg+0x49/0x80 [ 523.462358] do_syscall_64+0x3b/0x90 [ 523.462361] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 523.462364] RIP: 0033:0x7f24552aa337 [ 523.462365] Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [ 523.462366] RSP: 002b:00007fff7f05a838 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 523.462368] RAX: ffffffffffffffda RBX: 000000006245bf91 RCX: 00007f24552aa337 [ 523.462368] RDX: 0000000000000000 RSI: 00007fff7f05a8a0 RDI: 0000000000000003 [ 523.462369] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 523.462370] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001 [ 523.462370] R13: 00007fff7f05ce08 R14: 0000000000000000 R15: 000055e5dfdd1040 [ 523.462373] </TASK> [ 523.462374] ---[ end trace ba537bc16f6bf4ed ]--- [2] https://github.com/FRRouting/frr/issues/6412 Fixes: 4c7e8084fd46 ("ipv4: Plumb support for nexthop object in a fib_info") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-5/+18
Merge in overtime fixes, no conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-6/+8
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-03-21 v2 We've added 137 non-merge commits during the last 17 day(s) which contain a total of 143 files changed, 7123 insertions(+), 1092 deletions(-). The main changes are: 1) Custom SEC() handling in libbpf, from Andrii. 2) subskeleton support, from Delyan. 3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao. 4) Fix net.core.bpf_jit_harden race, from Hou. 5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub. 6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami. The arch specific bits will come later. 7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri. 8) Enable non-atomic allocations in local storage, from Joanne. 9) Various var_off ptr_to_btf_id fixed, from Kumar. 10) bpf_ima_file_hash helper, from Roberto. 11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits) selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" bpftool: Fix a bug in subskeleton code generation bpf: Fix bpf_prog_pack when PMU_SIZE is not defined bpf: Fix bpf_prog_pack for multi-node setup bpf: Fix warning for cast from restricted gfp_t in verifier bpf, arm: Fix various typos in comments libbpf: Close fd in bpf_object__reuse_map bpftool: Fix print error when show bpf map bpf: Fix kprobe_multi return probe backtrace Revert "bpf: Add support to inline bpf_get_func_ip helper on x86" bpf: Simplify check in btf_parse_hdr() selftests/bpf/test_lirc_mode2.sh: Exit with proper code bpf: Check for NULL return from bpf_get_btf_vmlinux selftests/bpf: Test skipping stacktrace bpf: Adjust BPF stack helper functions to accommodate skip > 0 bpf: Select proper size for bpf_prog_pack ... ==================== Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-22tcp: ensure PMTU updates are processed during fastopenJakub Kicinski1-1/+4
tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we rise it to the local MSS. tp->mss_cache is not updated, however: tcp_v6_connect(): tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers; tcp_connect(): tcp_connect_init(): tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp) tcp_send_syn_data(): tp->rx_opt.mss_clamp = tp->advmss After recent fixes to ICMPv6 PTB handling we started dropping PMTU updates higher than tp->mss_cache. Because of the stale tp->mss_cache value PMTU updates during TFO are always dropped. Thanks to Wei for helping zero in on the problem and the fix! Fixes: c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages") Reported-by: Andre Nash <alnash@fb.com> Reported-by: Neil Spring <ntspring@fb.com> Reviewed-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-20netfilter: nf_nat_h323: eliminate anonymous module_init & module_exitRandy Dunlap1-4/+4
Eliminate anonymous module_init() and module_exit(), which can lead to confusion or ambiguity when reading System.map, crashes/oops/bugs, or an initcall_debug log. Give each of these init and exit functions unique driver-specific names to eliminate the anonymous names. Example 1: (System.map) ffffffff832fc78c t init ffffffff832fc79e t init ffffffff832fc8f8 t init Example 2: (initcall_debug log) calling init+0x0/0x12 @ 1 initcall init+0x0/0x12 returned 0 after 15 usecs calling init+0x0/0x60 @ 1 initcall init+0x0/0x60 returned 0 after 2 usecs calling init+0x0/0x9a @ 1 initcall init+0x0/0x9a returned 0 after 74 usecs Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: nft_fib: add reduce supportFlorian Westphal1-0/+2
The fib expression stores to a register, so we can't add empty stub. Check that the register that is being written is in fact redundant. In most cases, this is expected to cancel tracking as re-use is unlikely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: nf_tables: do not reduce read-only expressionsPablo Neira Ayuso2-0/+2
Skip register tracking for expressions that perform read-only operations on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to avoid defining stubs for these expressions. This patch re-enables register tracking which was disabled in ed5f85d42290 ("netfilter: nf_tables: disable register tracking"). Follow up patches add remaining register tracking for existing expressions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-19ipv4: Fix route lookups when handling ICMP redirects and PMTU updatesGuillaume Nault1-4/+14
The PMTU update and ICMP redirect helper functions initialise their fl4 variable with either __build_flow_key() or build_sk_flow_key(). These initialisation functions always set ->flowi4_scope with RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is not a problem when the route lookup is later done via ip_route_output_key_hash(), which properly clears the ECN bits from ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK flag. However, some helpers call fib_lookup() directly, without sanitising the tos and scope fields, so the route lookup can fail and, as a result, the ICMP redirect or PMTU update aren't taken into account. Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation code into ip_rt_fix_tos(), then use this function in handlers that call fib_lookup() directly. Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central place (like __build_flow_key() or flowi4_init_output()), because ip_route_output_key_hash() expects non-sanitised values. When called with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to sanitise the values only for those paths that don't call ip_route_output_key_hash(). Note 2: The problem is mostly about sanitising ->flowi4_tos. Having ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of RT_SCOPE_LINK probably wasn't really a problem: sockets with the SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set) normally shouldn't receive ICMP redirects or PMTU updates. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-16net: Add l3mdev index to flow struct and avoid oif reset for port devicesDavid Ahern5-15/+9
The fundamental premise of VRF and l3mdev core code is binding a socket to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope. Legacy code resets flowi_oif to the l3mdev losing any original port device binding. Ben (among others) has demonstrated use cases where the original port device binding is important and needs to be retained. This patch handles that by adding a new entry to the common flow struct that can indicate the l3mdev index for later rule and table matching avoiding the need to reset flowi_oif. In addition to allowing more use cases that require port device binds, this patch brings a few datapath simplications: 1. l3mdev_fib_rule_match is only called when walking fib rules and always after l3mdev_update_flow. That allows an optimization to bail early for non-VRF type uses cases when flowi_l3mdev is not set. Also, only that index needs to be checked for the FIB table id. 2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev (e.g., VRF) device. By resetting flowi_oif only for this case the FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed, removing several checks in the datapath. The flowi_iif path can be simplified to only be called if the it is not loopback (loopback can not be assigned to an L3 domain) and the l3mdev index is not already set. 3. Avoid another device lookup in the output path when the fib lookup returns a reject failure. Note: 2 functional tests for local traffic with reject fib rules are updated to reflect the new direct failure at FIB lookup time for ping rather than the failure on packet path. The current code fails like this: HINT: Fails since address on vrf device is out of device scope COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1 ping: Warning: source address might be selected on device other than: eth1 PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data. --- 172.16.3.1 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms where the test now directly fails: HINT: Fails since address on vrf device is out of device scope COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1 ping: connect: No route to host Signed-off-by: David Ahern <dsahern@kernel.org> Tested-by: Ben Greear <greearb@candelatech.com> Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-15bpf, sockmap: Fix double uncharge the mem of sk_msgWang Yufen1-4/+3
If tcp_bpf_sendmsg is running during a tear down operation, psock may be freed. tcp_bpf_sendmsg() tcp_bpf_send_verdict() sk_msg_return() tcp_bpf_sendmsg_redir() unlikely(!psock)) sk_msg_free() The mem of msg has been uncharged in tcp_bpf_send_verdict() by sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock is null, we can simply returning an error code, this would then trigger the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have the side effect of throwing an error up to user space. This would be a slight change in behavior from user side but would look the same as an error if the redirect on the socket threw an error. This issue can cause the following info: WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260 Call Trace: <TASK> __sk_destruct+0x24/0x1f0 sk_psock_destroy+0x19b/0x1c0 process_one_work+0x1b3/0x3c0 worker_thread+0x30/0x350 ? process_one_work+0x3c0/0x3c0 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
2022-03-15bpf, sockmap: Fix more uncharged while msg has more_dataWang Yufen1-2/+5
In tcp_bpf_send_verdict(), if msg has more data after tcp_bpf_sendmsg_redir(): tcp_bpf_send_verdict() tosend = msg->sg.size //msg->sg.size = 22220 case __SK_REDIRECT: sk_msg_return() //uncharged msg->sg.size(22220) sk->sk_forward_alloc tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000 goto more_data; tosend = msg->sg.size //msg->sg.size = 11000 case __SK_REDIRECT: sk_msg_return() //uncharged msg->sg.size(11000) to sk->sk_forward_alloc The msg->sg.size(11000) has been uncharged twice, to fix we can charge the remaining msg->sg.size before goto more data. This issue can cause the following info: WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 __tcp_close+0x279/0x470 tcp_close+0x1f/0x60 inet_release+0x3f/0x80 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x92/0x250 task_work_run+0x6a/0xa0 do_exit+0x33b/0xb60 do_group_exit+0x2f/0xa0 get_signal+0xb6/0x950 arch_do_signal_or_restart+0xac/0x2a0 ? vfs_write+0x237/0x290 exit_to_user_mode_prepare+0xa9/0x200 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x46/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260 Call Trace: <TASK> __sk_destruct+0x24/0x1f0 sk_psock_destroy+0x19b/0x1c0 process_one_work+0x1b3/0x3c0 worker_thread+0x30/0x350 ? process_one_work+0x3c0/0x3c0 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com
2022-03-12tcp: unexport tcp_ca_get_key_by_name and tcp_ca_get_name_by_keyChristoph Hellwig1-2/+0
Both functions are only used by core networking code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220310143229.895319-1-hch@lst.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+9
net/dsa/dsa2.c commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails") commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master") https://lore.kernel.org/all/20220307101436.7ae87da0@canb.auug.org.au/ drivers/net/ethernet/intel/ice/ice.h commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU") commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device") https://lore.kernel.org/all/20220310112843.3233bcf1@canb.auug.org.au/ drivers/staging/gdm724x/gdm_lte.c commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()") commit 4bcc4249b4cf ("staging: Use netif_rx().") https://lore.kernel.org/all/20220308111043.1018a59d@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10tcp: adjust TSO packet sizes based on min_rttEric Dumazet3-12/+29
Back when tcp_tso_autosize() and TCP pacing were introduced, our focus was really to reduce burst sizes for long distance flows. The simple heuristic of using sk_pacing_rate/1024 has worked well, but can lead to too small packets for hosts in the same rack/cluster, when thousands of flows compete for the bottleneck. Neal Cardwell had the idea of making the TSO burst size a function of both sk_pacing_rate and tcp_min_rtt() Indeed, for local flows, sending bigger bursts is better to reduce cpu costs, as occasional losses can be repaired quite fast. This patch is based on Neal Cardwell implementation done more than two years ago. bbr is adjusting max_pacing_rate based on measured bandwidth, while cubic would over estimate max_pacing_rate. /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable this new feature, in logarithmic steps. Tested: 100Gbit NIC, two hosts in the same rack, 4K MTU. 600 flows rate-limited to 20000000 bytes per second. Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO) ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96005 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 65,945.29 msec task-clock # 2.845 CPUs utilized 1,314,632 context-switches # 19935.279 M/sec 5,292 cpu-migrations # 80.249 M/sec 940,641 page-faults # 14264.023 M/sec 201,117,030,926 cycles # 3049769.216 GHz (83.45%) 17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%) 136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%) 53,809,530,436 instructions # 0.27 insn per cycle # 2.54 stalled cycles per insn (83.36%) 9,062,315,523 branches # 137422329.563 M/sec (83.22%) 153,008,621 branch-misses # 1.69% of all branches (83.32%) 23.182970846 seconds time elapsed TcpInSegs 15648792 0.0 TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet TcpExtTCPDelivered 58654791 0.0 TcpExtTCPDeliveredCE 19 0.0 After patch: ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96046 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 48,982.58 msec task-clock # 2.104 CPUs utilized 186,014 context-switches # 3797.599 M/sec 3,109 cpu-migrations # 63.472 M/sec 941,180 page-faults # 19214.814 M/sec 153,459,763,868 cycles # 3132982.807 GHz (83.56%) 12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%) 120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%) 36,803,672,106 instructions # 0.24 insn per cycle # 3.27 stalled cycles per insn (83.18%) 5,947,266,275 branches # 121417383.427 M/sec (83.64%) 87,984,616 branch-misses # 1.48% of all branches (83.43%) 23.281200256 seconds time elapsed TcpInSegs 1434706 0.0 TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet TcpExtTCPDelivered 58878971 0.0 TcpExtTCPDeliveredCE 9664 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10tcp: autocork: take MSG_EOR hint into considerationEric Dumazet1-1/+2
tcp_should_autocork() is evaluating if it makes senses to not immediately send current skb, hoping that user space will add more payload on it by the time TCP stack reacts to upcoming TX completions. If current skb got MSG_EOR mark, then we know that no further data will be added, it is therefore futile to wait. SOF_TIMESTAMPING_TX_ACK will become a bit more accurate, if prior packets are still in qdisc/device queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20220309054706.2857266-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09Merge branch 'master' of ↵David S. Miller2-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-03-09 1) Fix IPv6 PMTU discovery for xfrm interfaces. From Lina Wang. 2) Revert failing for policies and states that are configured with XFRMA_IF_ID 0. It broke a user configuration. From Kai Lueke. 3) Fix a possible buffer overflow in the ESP output path. 4) Fix ESP GSO for tunnel and BEET mode on inter address family tunnels. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09skb: make drop reason booleanableJakub Kicinski2-17/+16
We have a number of cases where function returns drop/no drop decision as a boolean. Now that we want to report the reason code as well we have to pass extra output arguments. We can make the reason code evaluate correctly as bool. I believe we're good to reorder the reasons as they are reported to user space as strings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07net: Fix esp GSO on inter address family tunnels.Steffen Klassert1-2/+1
The esp tunnel GSO handlers use skb_mac_gso_segment to push the inner packet to the segmentation handlers. However, skb_mac_gso_segment takes the Ethernet Protocol ID from 'skb->protocol' which is wrong for inter address family tunnels. We fix this by introducing a new skb_eth_gso_segment function. This function can be used if it is necessary to pass the Ethernet Protocol ID directly to the segmentation handler. First users of this function will be the esp4 and esp6 tunnel segmentation handlers. Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07esp: Fix BEET mode inter address family tunneling on GSOSteffen Klassert1-0/+3
The xfrm{4,6}_beet_gso_segment() functions did not correctly set the SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 gso types for the address family tunneling case. Fix this by setting these gso types. Fixes: 384a46ea7bdc7 ("esp4: add gso_segment for esp4 beet mode") Fixes: 7f9e40eb18a99 ("esp6: add gso_segment for esp6 beet mode") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07esp: Fix possible buffer overflow in ESP transformationSteffen Klassert1-0/+5
The maximum message size that can be send is bigger than the maximum site that skb_page_frag_refill can allocate. So it is possible to write beyond the allocated buffer. Fix this by doing a fallback to COW in that case. v2: Avoid get get_order() costs as suggested by Linus Torvalds. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Reported-by: valis <sec@valis.email> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>