summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2022-02-24ipv6: prevent a possible race condition with lifetimesNiels Dossche1-0/+2
valid_lft, prefered_lft and tstamp are always accessed under the lock "lock" in other places. Reading these without taking the lock may result in inconsistencies regarding the calculation of the valid and preferred variables since decisions are taken on these fields for those variables. Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Link: https://lore.kernel.org/r/20220223131954.6570-1-niels.dossche@ugent.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-21gso: do not skip outer ip header in case of ipip and net_failoverTao Liu1-0/+2
We encounter a tcp drop issue in our cloud environment. Packet GROed in host forwards to a VM virtio_net nic with net_failover enabled. VM acts as a IPVS LB with ipip encapsulation. The full path like: host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat -> ipip encap -> net_failover tx -> virtio_net tx When net_failover transmits a ipip pkt (gso_type = 0x0103, which means SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso did because it supports TSO and GSO_IPXIP4. But network_header points to inner ip header. Call Trace: tcp4_gso_segment ------> return NULL inet_gso_segment ------> inner iph, network_header points to ipip_gso_segment inet_gso_segment ------> outer iph skb_mac_gso_segment Afterwards virtio_net transmits the pkt, only inner ip header is modified. And the outer one just keeps unchanged. The pkt will be dropped in remote host. Call Trace: inet_gso_segment ------> inner iph, outer iph is skipped skb_mac_gso_segment __skb_gso_segment validate_xmit_skb validate_xmit_skb_list sch_direct_xmit __qdisc_run __dev_queue_xmit ------> virtio_net dev_hard_start_xmit __dev_queue_xmit ------> net_failover ip_finish_output2 ip_output iptunnel_xmit ip_tunnel_xmit ipip_tunnel_xmit ------> ipip dev_hard_start_xmit __dev_queue_xmit ip_finish_output2 ip_output ip_forward ip_rcv __netif_receive_skb_one_core netif_receive_skb_internal napi_gro_receive receive_buf virtnet_poll net_rx_action The root cause of this issue is specific with the rare combination of SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option. SKB_GSO_DODGY is set from external virtio_net. We need to reset network header when callbacks.gso_segment() returns NULL. This patch also includes ipv6_gso_segment(), considering SIT, etc. Fixes: cb32f511a70b ("ipip: add GSO/TSO support") Signed-off-by: Tao Liu <thomas.liu@ucloud.cn> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-18net-timestamp: convert sk->sk_tskey to atomic_tEric Dumazet1-1/+1
UDP sendmsg() can be lockless, this is causing all kinds of data races. This patch converts sk->sk_tskey to remove one of these races. BUG: KCSAN: data-race in __ip_append_data / __ip_append_data read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1: __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0: __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000054d -> 0x0000054e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-17ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rtEric Dumazet1-9/+10
Because fib6_info_hw_flags_set() is called without any synchronization, all accesses to gi6->offload, fi->trap and fi->offload_failed need some basic protection like READ_ONCE()/WRITE_ONCE(). BUG: KCSAN: data-race in fib6_info_hw_flags_set / fib6_purge_rt read to 0xffff8881087d5886 of 1 bytes by task 13953 on cpu 0: fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1007 [inline] fib6_purge_rt+0x4f/0x580 net/ipv6/ip6_fib.c:1033 fib6_del_route net/ipv6/ip6_fib.c:1983 [inline] fib6_del+0x696/0x890 net/ipv6/ip6_fib.c:2028 __ip6_del_rt net/ipv6/route.c:3876 [inline] ip6_del_rt+0x83/0x140 net/ipv6/route.c:3891 __ipv6_dev_ac_dec+0x2b5/0x370 net/ipv6/anycast.c:374 ipv6_dev_ac_dec net/ipv6/anycast.c:387 [inline] __ipv6_sock_ac_close+0x141/0x200 net/ipv6/anycast.c:207 ipv6_sock_ac_close+0x79/0x90 net/ipv6/anycast.c:220 inet6_release+0x32/0x50 net/ipv6/af_inet6.c:476 __sock_release net/socket.c:650 [inline] sock_close+0x6c/0x150 net/socket.c:1318 __fput+0x295/0x520 fs/file_table.c:280 ____fput+0x11/0x20 fs/file_table.c:313 task_work_run+0x8e/0x110 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x160/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881087d5886 of 1 bytes by task 1912 on cpu 1: fib6_info_hw_flags_set+0x155/0x3b0 net/ipv6/route.c:6230 nsim_fib6_rt_hw_flags_set drivers/net/netdevsim/fib.c:668 [inline] nsim_fib6_rt_add drivers/net/netdevsim/fib.c:691 [inline] nsim_fib6_rt_insert drivers/net/netdevsim/fib.c:756 [inline] nsim_fib6_event drivers/net/netdevsim/fib.c:853 [inline] nsim_fib_event drivers/net/netdevsim/fib.c:886 [inline] nsim_fib_event_work+0x284f/0x2cf0 drivers/net/netdevsim/fib.c:1477 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x22 -> 0x2a Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 1912 Comm: kworker/1:3 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events nsim_fib_event_work Fixes: 0c5fcf9e249e ("IPv6: Add "offload failed" indication to routes") Fixes: bb3c4ab93e44 ("ipv6: Add "offload" and "trap" indications to routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amit Cohen <amcohen@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220216173217.3792411-2-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17ipv6: per-netns exclusive flowlabel checksWillem de Bruijn1-1/+3
Ipv6 flowlabels historically require a reservation before use. Optionally in exclusive mode (e.g., user-private). Commit 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist") introduced a fastpath that avoids this check when no exclusive leases exist in the system, and thus any flowlabel use will be granted. That allows skipping the control operation to reserve a flowlabel entirely. Though with a warning if the fast path fails: This is an optimization. Robust applications still have to revert to requesting leases if the fast path fails due to an exclusive lease. Still, this is subtle. Better isolate network namespaces from each other. Flowlabels are per-netns. Also record per-netns whether exclusive leases are in use. Then behavior does not change based on activity in other netns. Changes v2 - wrap in IS_ENABLED(CONFIG_IPV6) to avoid breakage if disabled Fixes: 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist") Link: https://lore.kernel.org/netdev/MWHPR2201MB1072BCCCFCE779E4094837ACD0329@MWHPR2201MB1072.namprd22.prod.outlook.com/ Reported-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: Willem de Bruijn <willemb@google.com> Tested-by: Congyu Liu <liu3101@purdue.edu> Link: https://lore.kernel.org/r/20220215160037.1976072-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-14ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()Ignat Korchagin2-3/+3
Some time ago 8965779d2c0e ("ipv6,mcast: always hold idev->lock before mca_lock") switched ipv6_get_lladdr() to __ipv6_get_lladdr(), which is rcu-unsafe version. That was OK, because idev->lock was held for these codepaths. In 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") these external locks were removed, so we probably need to restore the original rcu-safe call. Otherwise, we occasionally get a machine crashed/stalled with the following in dmesg: [ 3405.966610][T230589] general protection fault, probably for non-canonical address 0xdead00000000008c: 0000 [#1] SMP NOPTI [ 3405.982083][T230589] CPU: 44 PID: 230589 Comm: kworker/44:3 Tainted: G O 5.15.19-cloudflare-2022.2.1 #1 [ 3405.998061][T230589] Hardware name: SUPA-COOL-SERV [ 3406.009552][T230589] Workqueue: mld mld_ifc_work [ 3406.017224][T230589] RIP: 0010:__ipv6_get_lladdr+0x34/0x60 [ 3406.025780][T230589] Code: 57 10 48 83 c7 08 48 89 e5 48 39 d7 74 3e 48 8d 82 38 ff ff ff eb 13 48 8b 90 d0 00 00 00 48 8d 82 38 ff ff ff 48 39 d7 74 22 <66> 83 78 32 20 77 1b 75 e4 89 ca 23 50 2c 75 dd 48 8b 50 08 48 8b [ 3406.055748][T230589] RSP: 0018:ffff94e4b3fc3d10 EFLAGS: 00010202 [ 3406.065617][T230589] RAX: dead00000000005a RBX: ffff94e4b3fc3d30 RCX: 0000000000000040 [ 3406.077477][T230589] RDX: dead000000000122 RSI: ffff94e4b3fc3d30 RDI: ffff8c3a31431008 [ 3406.089389][T230589] RBP: ffff94e4b3fc3d10 R08: 0000000000000000 R09: 0000000000000000 [ 3406.101445][T230589] R10: ffff8c3a31430000 R11: 000000000000000b R12: ffff8c2c37887100 [ 3406.113553][T230589] R13: ffff8c3a39537000 R14: 00000000000005dc R15: ffff8c3a31431000 [ 3406.125730][T230589] FS: 0000000000000000(0000) GS:ffff8c3b9fc80000(0000) knlGS:0000000000000000 [ 3406.138992][T230589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3406.149895][T230589] CR2: 00007f0dfea1db60 CR3: 000000387b5f2000 CR4: 0000000000350ee0 [ 3406.162421][T230589] Call Trace: [ 3406.170235][T230589] <TASK> [ 3406.177736][T230589] mld_newpack+0xfe/0x1a0 [ 3406.186686][T230589] add_grhead+0x87/0xa0 [ 3406.195498][T230589] add_grec+0x485/0x4e0 [ 3406.204310][T230589] ? newidle_balance+0x126/0x3f0 [ 3406.214024][T230589] mld_ifc_work+0x15d/0x450 [ 3406.223279][T230589] process_one_work+0x1e6/0x380 [ 3406.232982][T230589] worker_thread+0x50/0x3a0 [ 3406.242371][T230589] ? rescuer_thread+0x360/0x360 [ 3406.252175][T230589] kthread+0x127/0x150 [ 3406.261197][T230589] ? set_kthread_struct+0x40/0x40 [ 3406.271287][T230589] ret_from_fork+0x22/0x30 [ 3406.280812][T230589] </TASK> [ 3406.288937][T230589] Modules linked in: ... [last unloaded: kheaders] [ 3406.476714][T230589] ---[ end trace 3525a7655f2f3b9e ]--- Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") Reported-by: David Pinilla Caparros <dpini@cloudflare.com> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure pathEric Dumazet1-0/+2
ip[6]mr_free_table() can only be called under RTNL lock. RTNL: assertion failed at net/core/dev.c (10367) WARNING: CPU: 1 PID: 5890 at net/core/dev.c:10367 unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367 Modules linked in: CPU: 1 PID: 5890 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller-11627-g422ee58dc0ef #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367 Code: 0f 85 9b ee ff ff e8 69 07 4b fa ba 7f 28 00 00 48 c7 c6 00 90 ae 8a 48 c7 c7 40 90 ae 8a c6 05 6d b1 51 06 01 e8 8c 90 d8 01 <0f> 0b e9 70 ee ff ff e8 3e 07 4b fa 4c 89 e7 e8 86 2a 59 fa e9 ee RSP: 0018:ffffc900046ff6e0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888050f51d00 RSI: ffffffff815fa008 RDI: fffff520008dfece RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815f3d6e R11: 0000000000000000 R12: 00000000fffffff4 R13: dffffc0000000000 R14: ffffc900046ff750 R15: ffff88807b7dc000 FS: 00007f4ab736e700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fee0b4f8990 CR3: 000000001e7d2000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> mroute_clean_tables+0x244/0xb40 net/ipv6/ip6mr.c:1509 ip6mr_free_table net/ipv6/ip6mr.c:389 [inline] ip6mr_rules_init net/ipv6/ip6mr.c:246 [inline] ip6mr_net_init net/ipv6/ip6mr.c:1306 [inline] ip6mr_net_init+0x3f0/0x4e0 net/ipv6/ip6mr.c:1298 ops_init+0xaf/0x470 net/core/net_namespace.c:140 setup_net+0x54f/0xbb0 net/core/net_namespace.c:331 copy_net_ns+0x318/0x760 net/core/net_namespace.c:475 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 copy_namespaces+0x391/0x450 kernel/nsproxy.c:178 copy_process+0x2e0c/0x7300 kernel/fork.c:2167 kernel_clone+0xe7/0xab0 kernel/fork.c:2555 __do_sys_clone+0xc8/0x110 kernel/fork.c:2672 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4ab89f9059 Code: Unable to access opcode bytes at RIP 0x7f4ab89f902f. RSP: 002b:00007f4ab736e118 EFLAGS: 00000206 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 00007f4ab8b0bf60 RCX: 00007f4ab89f9059 RDX: 0000000020000280 RSI: 0000000020000270 RDI: 0000000040200000 RBP: 00007f4ab8a5308d R08: 0000000020000300 R09: 0000000020000300 R10: 00000000200002c0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffc3977cc1f R14: 00007f4ab736e300 R15: 0000000000022000 </TASK> Fixes: f243e5a7859a ("ipmr,ip6mr: call ip6mr_free_table() on failure path") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <cong.wang@bytedance.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220208053451.2885398-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski3-7/+0
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Remove leftovers from flowtable modules, from Geert Uytterhoeven. 2) Missing refcount increment of conntrack template in nft_ct, from Florian Westphal. 3) Reduce nft_zone selftest time, also from Florian. 4) Add selftest to cover stateless NAT on fragments, from Florian Westphal. 5) Do not set net_device when for reject packets from the bridge path, from Phil Sutter. 6) Cancel register tracking info on nft_byteorder operations. 7) Extend nft_concat_range selftest to cover set reload with no elements, from Florian Westphal. 8) Remove useless update of pointer in chain blob builder, reported by kbuild test robot. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nf_tables: remove assignment with no effect in chain blob builder selftests: nft_concat_range: add test for reload with no element add/del netfilter: nft_byteorder: track register operations netfilter: nft_reject_bridge: Fix for missing reply from prerouting selftests: netfilter: check stateless nat udp checksum fixup selftests: netfilter: reduce zone stress test running time netfilter: nft_ct: fix use after free when attaching zone template netfilter: Remove flowtable relics ==================== Link: https://lore.kernel.org/r/20220127235235.656931-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"Guillaume Nault1-7/+20
This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d. This commit breaks Linux compatibility with USGv6 tests. The RFC this commit was based on is actually an expired draft: no published RFC currently allows the new behaviour it introduced. Without full IETF endorsement, the flash renumbering scenario this patch was supposed to enable is never going to work, as other IPv6 equipements on the same LAN will keep the 2 hours limit. Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27netfilter: Remove flowtable relicsGeert Uytterhoeven3-7/+0
NF_FLOW_TABLE_IPV4 and NF_FLOW_TABLE_IPV6 are invisble, selected by nothing (so they can no longer be enabled), and their last real users have been removed (nf_flow_table_ipv6.c is empty). Clean up the leftovers. Fixes: c42ba4290b2147aa ("netfilter: flowtable: remove ipv4/ipv6 modules") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-21ipv6: annotate accesses to fn->fn_sernumEric Dumazet2-11/+14
struct fib6_node's fn_sernum field can be read while other threads change it. Add READ_ONCE()/WRITE_ONCE() annotations. Do not change existing smp barriers in fib6_get_cookie_safe() and __fib6_update_sernum_upto_root() syzbot reported: BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1: fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178 fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112 fib6_walk net/ipv6/ip6_fib.c:2160 [inline] fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline] __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256 fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281 rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline] addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230 addrconf_dad_work+0x908/0x1170 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x1bf/0x1e0 kernel/kthread.c:359 ret_from_fork+0x1f/0x30 read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0: fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline] rt6_get_cookie include/net/ip6_fib.h:306 [inline] ip6_dst_store include/net/ip6_route.h:234 [inline] inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109 inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121 __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725 mptcp_push_release net/mptcp/protocol.c:1491 [inline] __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578 mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] kernel_sendmsg+0x97/0xd0 net/socket.c:745 sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086 inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834 kernel_sendpage+0x187/0x200 net/socket.c:3492 sock_sendpage+0x5a/0x70 net/socket.c:1007 pipe_to_sendpage+0x128/0x160 fs/splice.c:364 splice_from_pipe_feed fs/splice.c:418 [inline] __splice_from_pipe+0x207/0x500 fs/splice.c:562 splice_from_pipe fs/splice.c:597 [inline] generic_splice_sendpage+0x94/0xd0 fs/splice.c:746 do_splice_from fs/splice.c:767 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:936 splice_direct_to_actor+0x345/0x650 fs/splice.c:891 do_splice_direct+0x106/0x190 fs/splice.c:979 do_sendfile+0x675/0xc40 fs/read_write.c:1245 __do_sys_sendfile64 fs/read_write.c:1310 [inline] __se_sys_sendfile64 fs/read_write.c:1296 [inline] __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000026f -> 0x00000271 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 The Fixes tag I chose is probably arbitrary, I do not think we need to backport this patch to older kernels. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20ipv6_tunnel: Rate limit warning messagesIdo Schimmel1-4/+4
The warning messages can be invoked from the data path for every packet transmitted through an ip6gre netdev, leading to high CPU utilization. Fix that by rate limiting the messages. Fixes: 09c6bbf090ec ("[IPV6]: Do mandatory IPv6 tunnel endpoint checks in realtime") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Tested-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13sit: allow encapsulated IPv6 traffic to be delivered locallyIgnat Korchagin1-1/+1
While experimenting with FOU encapsulation Amir noticed that encapsulated IPv6 traffic fails to be delivered, if the peer IP address is configured locally. It can be easily verified by creating a sit interface like below: $ sudo ip link add name fou_test type sit remote 127.0.0.1 encap fou encap-sport auto encap-dport 1111 $ sudo ip link set fou_test up and sending some IPv4 and IPv6 traffic to it $ ping -I fou_test -c 1 1.1.1.1 $ ping6 -I fou_test -c 1 fe80::d0b0:dfff:fe4c:fcbc "tcpdump -i any udp dst port 1111" will confirm that only the first IPv4 ping was encapsulated and attempted to be delivered. This seems like a limitation: for example, in a cloud environment the "peer" service may be arbitrarily scheduled on any server within the cluster, where all nodes are trying to send encapsulated traffic. And the unlucky node will not be able to. Moreover, delivering encapsulated IPv4 traffic locally is allowed. But I may not have all the context about this restriction and this code predates the observable git history. Reported-by: Amir Razmjou <arazmjou@cloudflare.com> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220107123842.211335-1-ignat@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+4
Merge in fixes directly in prep for the 5.17 merge window. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski2-44/+2
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. This includes one patch to update ovs and act_ct to use nf_ct_put() instead of nf_conntrack_put(). 1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet. 2) Remove redundant rcu read-size lock in nf_tables packet path. 3) Replace BUG() by WARN_ON_ONCE() in nft_payload. 4) Consolidate rule verdict tracing. 5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core. 6) Make counter support built-in in nf_tables. 7) Add new field to conntrack object to identify locally generated traffic, from Florian Westphal. 8) Prevent NAT from shadowing well-known ports, from Florian Westphal. 9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from Florian. 10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King. 11) Replace opencoded max() in conntrack, from Jiapeng Chong. 12) Update conntrack to use refcount_t API, from Florian Westphal. 13) Move ip_ct_attach indirection into the nf_ct_hook structure. 14) Constify several pointer object in the netfilter codebase, from Florian Westphal. 15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also from Florian. 16) Fix egress splat due to incorrect rcu notation, from Florian. 17) Move stateful fields of connlimit, last, quota, numgen and limit out of the expression data area. 18) Build a blob to represent the ruleset in nf_tables, this is a requirement of the new register tracking infrastructure. 19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers. 20) Add register tracking infrastructure to skip redundant store-to-register operations, this includes support for payload, meta and bitwise expresssions. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits) netfilter: nft_meta: cancel register tracking after meta update netfilter: nft_payload: cancel register tracking after payload update netfilter: nft_bitwise: track register operations netfilter: nft_meta: track register operations netfilter: nft_payload: track register operations netfilter: nf_tables: add register tracking infrastructure netfilter: nf_tables: add NFT_REG32_NUM netfilter: nf_tables: add rule blob layout netfilter: nft_limit: move stateful fields out of expression data netfilter: nft_limit: rename stateful structure netfilter: nft_numgen: move stateful fields out of expression data netfilter: nft_quota: move stateful fields out of expression data netfilter: nft_last: move stateful fields out of expression data netfilter: nft_connlimit: move stateful fields out of expression data netfilter: egress: avoid a lockdep splat net: prefer nf_ct_put instead of nf_conntrack_put netfilter: conntrack: avoid useless indirection during conntrack destruction netfilter: make function op structures const netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook netfilter: conntrack: convert to refcount_t api ... ==================== Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski4-0/+5
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong4-0/+5
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06Merge branch 'master' of ↵David S. Miller1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-01-06 1) Fix xfrm policy lookups for ipv6 gre packets by initializing fl6_gre_key properly. From Ghalem Boudour. 2) Fix the dflt policy check on forwarding when there is no policy configured. The check was done for the wrong direction. From Nicolas Dichtel. 3) Use the correct 'struct xfrm_user_offload' when calculating netlink message lenghts in xfrm_sa_len(). From Eric Dumazet. 4) Tread inserting xfrm interface id 0 as an error. From Antony Antony. 5) Fail if xfrm state or policy is inserted with XFRMA_IF_ID 0, xfrm interfaces with id 0 are not allowed. From Antony Antony. 6) Fix inner_ipproto setting in the sec_path for tunnel mode. From Raed Salem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-35/+98
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-04udp6: Use Segment Routing Header for dest address if presentAndrew Lunn1-1/+2
When finding the socket to report an error on, if the invoking packet is using Segment Routing, the IPv6 destination address is that of an intermediate router, not the end destination. Extract the ultimate destination address from the segment address. This change allows traceroute to function in the presence of Segment Routing. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04icmp: ICMPV6: Examine invoking packet for Segment Route Headers.Andrew Lunn2-1/+35
RFC8754 says: ICMP error packets generated within the SR domain are sent to source nodes within the SR domain. The invoking packet in the ICMP error message may contain an SRH. Since the destination address of a packet with an SRH changes as each segment is processed, it may not be the destination used by the socket or application that generated the invoking packet. For the source of an invoking packet to process the ICMP error message, the ultimate destination address of the IPv6 header may be required. The following logic is used to determine the destination address for use by protocol-error handlers. * Walk all extension headers of the invoking IPv6 packet to the routing extension header preceding the upper-layer header. - If routing header is type 4 Segment Routing Header (SRH) o The SID at Segment List[0] may be used as the destination address of the invoking packet. Mangle the skb so the network header points to the invoking packet inside the ICMP packet. The seg6 helpers can then be used on the skb to find any segment routing headers. If found, mark this fact in the IPv6 control block of the skb, and store the offset into the packet of the SRH. Then restore the skb back to its old state. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04seg6: export get_srh() for ICMP handlingAndrew Lunn2-31/+31
An ICMP error message can contain in its message body part of an IPv6 packet which invoked the error. Such a packet might contain a segment router header. Export get_srh() so the ICMP code can make use of it. Since his changes the scope of the function from local to global, add the seg6_ prefix to keep the namespace clean. And move it into seg6.c so it is always available, not just when IPV6_SEG6_LWTUNNEL is enabled. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-03ipv6: Do cleanup if attribute validation fails in multipath routeDavid Ahern1-5/+3
As Nicolas noted, if gateway validation fails walking the multipath attribute the code should jump to the cleanup to free previously allocated memory. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103170555.94638-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03ipv6: Continue processing multipath route even if gateway attribute is invalidDavid Ahern1-2/+5
ip6_route_multipath_del loop continues processing the multipath attribute even if delete of a nexthop path fails. For consistency, do the same if the gateway attribute is invalid. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103171911.94739-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-02ipv6: ioam: Support for Queue depth data fieldJustin Iurman1-1/+15
v3: - Report 'backlog' (bytes) instead of 'qlen' (number of packets) v2: - Fix sparse warning (use rcu_dereference) This patch adds support for the queue depth in IOAM trace data fields. The draft [1] says the following: The "queue depth" field is a 4-octet unsigned integer field. This field indicates the current length of the egress interface queue of the interface from where the packet is forwarded out. The queue depth is expressed as the current amount of memory buffers used by the queue (a packet could consume one or more memory buffers, depending on its size). An existing function (i.e., qdisc_qstats_qlen_backlog) is used to retrieve the current queue length without reinventing the wheel. Note: it was tested and qlen is increasing when an artificial delay is added on the egress with tc. [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.7 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-0/+3
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-12-30 The following pull-request contains BPF updates for your *net-next* tree. We've added 72 non-merge commits during the last 20 day(s) which contain a total of 223 files changed, 3510 insertions(+), 1591 deletions(-). The main changes are: 1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii. 2) Beautify and de-verbose verifier logs, from Christy. 3) Composable verifier types, from Hao. 4) bpf_strncmp helper, from Hou. 5) bpf.h header dependency cleanup, from Jakub. 6) get_func_[arg|ret|arg_cnt] helpers, from Jiri. 7) Sleepable local storage, from KP. 8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-31lwtunnel: Validate RTA_ENCAP_TYPE attribute lengthDavid Ahern1-0/+4
lwtunnel_valid_encap_type_attr is used to validate encap attributes within a multipath route. Add length validation checking to the type. lwtunnel_valid_encap_type_attr is called converting attributes to fib{6,}_config struct which means it is used before fib_get_nhs, ip6_route_multipath_add, and ip6_route_multipath_del - other locations that use rtnh_ok and then nla_get_u16 on RTA_ENCAP_TYPE attribute. Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules") Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-31ipv6: Check attribute length for RTA_GATEWAY when deleting multipath routeDavid Ahern1-1/+5
Make sure RTA_GATEWAY for IPv6 multipath route has enough bytes to hold an IPv6 address. Fixes: 6b9ea5a64ed5 ("ipv6: fix multipath route replace error recovery") Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-31ipv6: Check attribute length for RTA_GATEWAY in multipath routeDavid Ahern1-1/+20
Commit referenced in the Fixes tag used nla_memcpy for RTA_GATEWAY as does the current nla_get_in6_addr. nla_memcpy protects against accessing memory greater than what is in the attribute, but there is no check requiring the attribute to have an IPv6 address. Add it. Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-1/+6
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c commit 077cdda764c7 ("net/mlx5e: TC, Fix memory leak with rules with internal port") commit 31108d142f36 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") commit 4390c6edc0fb ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/ net/smc/smc_wr.c commit 49dc9013e34b ("net/smc: Use the bitmap API when applicable") commit 349d43127dac ("net/smc: fix kernel panic caused by race of smc_sock") bitmap_zero()/memset() is removed by the fix Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29ipv6: raw: check passed optlen before readingTamir Duberstein1-0/+3
Add a check that the user-provided option is at least as long as the number of bytes we intend to read. Before this patch we would blindly read sizeof(int) bytes even in cases where the user passed optlen<sizeof(int), which would potentially read garbage or fault. Discovered by new tests in https://github.com/google/gvisor/pull/6957 . The original get_user call predates history in the git repo. Signed-off-by: Tamir Duberstein <tamird@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20211229200947.2862255-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: Don't include filter.h from net/sock.hJakub Kicinski2-0/+2
sock.h is pretty heavily used (5k objects rebuilt on x86 after it's touched). We can drop the include of filter.h from it and add a forward declaration of struct sk_filter instead. This decreases the number of rebuilt objects when bpf.h is touched from ~5k to ~1k. There's a lot of missing includes this was masking. Primarily in networking tho, this time. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
2021-12-27ip6_vti: initialize __ip6_tnl_parm struct in vti6_siocdevprivateWilliam Zhao1-0/+2
The "__ip6_tnl_parm" struct was left uninitialized causing an invalid load of random data when the "__ip6_tnl_parm" struct was used elsewhere. As an example, in the function "ip6_tnl_xmit_ctl()", it tries to access the "collect_md" member. With "__ip6_tnl_parm" being uninitialized and containing random data, the UBSAN detected that "collect_md" held a non-boolean value. The UBSAN issue is as follows: =============================================================== UBSAN: invalid-load in net/ipv6/ip6_tunnel.c:1025:14 load of value 30 is not a valid value for type '_Bool' CPU: 1 PID: 228 Comm: kworker/1:3 Not tainted 5.16.0-rc4+ #8 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Workqueue: ipv6_addrconf addrconf_dad_work Call Trace: <TASK> dump_stack_lvl+0x44/0x57 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value+0x66/0x70 ? __cpuhp_setup_state+0x1d3/0x210 ip6_tnl_xmit_ctl.cold.52+0x2c/0x6f [ip6_tunnel] vti6_tnl_xmit+0x79c/0x1e96 [ip6_vti] ? lock_is_held_type+0xd9/0x130 ? vti6_rcv+0x100/0x100 [ip6_vti] ? lock_is_held_type+0xd9/0x130 ? rcu_read_lock_bh_held+0xc0/0xc0 ? lock_acquired+0x262/0xb10 dev_hard_start_xmit+0x1e6/0x820 __dev_queue_xmit+0x2079/0x3340 ? mark_lock.part.52+0xf7/0x1050 ? netdev_core_pick_tx+0x290/0x290 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0x15/0x200 ? find_held_lock+0x3a/0x1c0 ? lock_release+0x42f/0xc90 ? lock_downgrade+0x6b0/0x6b0 ? mark_held_locks+0xb7/0x120 ? neigh_connected_output+0x31f/0x470 ? lockdep_hardirqs_on+0x79/0x100 ? neigh_connected_output+0x31f/0x470 ? ip6_finish_output2+0x9b0/0x1d90 ? rcu_read_lock_bh_held+0x62/0xc0 ? ip6_finish_output2+0x9b0/0x1d90 ip6_finish_output2+0x9b0/0x1d90 ? ip6_append_data+0x330/0x330 ? ip6_mtu+0x166/0x370 ? __ip6_finish_output+0x1ad/0xfb0 ? nf_hook_slow+0xa6/0x170 ip6_output+0x1fb/0x710 ? nf_hook.constprop.32+0x317/0x430 ? ip6_finish_output+0x180/0x180 ? __ip6_finish_output+0xfb0/0xfb0 ? lock_is_held_type+0xd9/0x130 ndisc_send_skb+0xb33/0x1590 ? __sk_mem_raise_allocated+0x11cf/0x1560 ? dst_output+0x4a0/0x4a0 ? ndisc_send_rs+0x432/0x610 addrconf_dad_completed+0x30c/0xbb0 ? addrconf_rs_timer+0x650/0x650 ? addrconf_dad_work+0x73c/0x10e0 addrconf_dad_work+0x73c/0x10e0 ? addrconf_dad_completed+0xbb0/0xbb0 ? rcu_read_lock_sched_held+0xaf/0xe0 ? rcu_read_lock_bh_held+0xc0/0xc0 process_one_work+0x97b/0x1740 ? pwq_dec_nr_in_flight+0x270/0x270 worker_thread+0x87/0xbf0 ? process_one_work+0x1740/0x1740 kthread+0x3ac/0x490 ? set_kthread_struct+0x100/0x100 ret_from_fork+0x22/0x30 </TASK> =============================================================== The solution is to initialize "__ip6_tnl_parm" struct to zeros in the "vti6_siocdevprivate()" function. Signed-off-by: William Zhao <wizhao@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-24udp: using datalen to cap ipv6 udp max gso segmentsCoco Li1-1/+1
The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS, this is checked in udp_send_skb(). skb->len contains network and transport header len here, we should use only data len instead. This is the ipv6 counterpart to the below referenced commit, which missed the ipv6 change Fixes: 158390e45612 ("udp: using datalen to cap max gso segments") Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20211223222441.2975883-1-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-6/+9
include/net/sock.h commit 8f905c0e7354 ("inet: fully convert sk->sk_rx_dst to RCU rules") commit 43f51df41729 ("net: move early demux fields close to sk_refcnt") https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23netfilter: flowtable: remove ipv4/ipv6 modulesFlorian Westphal2-44/+2
Just place the structs and registration in the inet module. nf_flow_table_ipv6, nf_flow_table_ipv4 and nf_flow_table_inet share same module dependencies: nf_flow_table, nf_tables. before: text data bss dec hex filename 2278 1480 0 3758 eae nf_flow_table_inet.ko 1159 1352 0 2511 9cf nf_flow_table_ipv6.ko 1154 1352 0 2506 9ca nf_flow_table_ipv4.ko after: 2369 1672 0 4041 fc9 nf_flow_table_inet.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-21inet: fully convert sk->sk_rx_dst to RCU rulesEric Dumazet2-6/+9
syzbot reported various issues around early demux, one being included in this changelog [1] sk->sk_rx_dst is using RCU protection without clearly documenting it. And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv() are not following standard RCU rules. [a] dst_release(dst); [b] sk->sk_rx_dst = NULL; They look wrong because a delete operation of RCU protected pointer is supposed to clear the pointer before the call_rcu()/synchronize_rcu() guarding actual memory freeing. In some cases indeed, dst could be freed before [b] is done. We could cheat by clearing sk_rx_dst before calling dst_release(), but this seems the right time to stick to standard RCU annotations and debugging facilities. [1] BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline] BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204 CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 __kasan_report mm/kasan/report.c:433 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 dst_check include/net/dst.h:470 [inline] tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 __netif_receive_skb_list net/core/dev.c:5608 [inline] netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 gro_normal_list net/core/dev.c:5853 [inline] gro_normal_list net/core/dev.c:5849 [inline] napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 __napi_poll+0xaf/0x440 net/core/dev.c:7023 napi_poll net/core/dev.c:7090 [inline] net_rx_action+0x801/0xb40 net/core/dev.c:7177 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240 asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629 RIP: 0033:0x7f5e972bfd57 Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73 RSP: 002b:00007fff8a413210 EFLAGS: 00000283 RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45 RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45 RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9 R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0 R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019 </TASK> Allocated by task 13: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:259 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3234 [inline] slab_alloc mm/slub.c:3242 [inline] kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247 dst_alloc+0x146/0x1f0 net/core/dst.c:92 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340 ip_route_input_rcu net/ipv4/route.c:2470 [inline] ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415 ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 __netif_receive_skb_list net/core/dev.c:5608 [inline] netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 gro_normal_list net/core/dev.c:5853 [inline] gro_normal_list net/core/dev.c:5849 [inline] napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 __napi_poll+0xaf/0x440 net/core/dev.c:7023 napi_poll net/core/dev.c:7090 [inline] net_rx_action+0x801/0xb40 net/core/dev.c:7177 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 Freed by task 13: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:1723 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 slab_free mm/slub.c:3513 [inline] kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530 dst_destroy+0x2d6/0x3f0 net/core/dst.c:127 rcu_do_batch kernel/rcu/tree.c:2506 [inline] rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 Last potentially related work creation: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348 __call_rcu kernel/rcu/tree.c:2985 [inline] call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065 dst_release net/core/dst.c:177 [inline] dst_release+0x79/0xe0 net/core/dst.c:167 tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2768 release_sock+0x54/0x1b0 net/core/sock.c:3300 tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 sock_write_iter+0x289/0x3c0 net/socket.c:1057 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write+0x429/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x1ee/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88807f1cb700 which belongs to the cache ip_dst_cache of size 176 The buggy address is located 58 bytes inside of 176-byte region [ffff88807f1cb700, ffff88807f1cb7b0) The buggy address belongs to the page: page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062 prep_new_page mm/page_alloc.c:2418 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191 alloc_slab_page mm/slub.c:1793 [inline] allocate_slab mm/slub.c:1930 [inline] new_slab+0x32d/0x4a0 mm/slub.c:1993 ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 slab_alloc_node mm/slub.c:3200 [inline] slab_alloc mm/slub.c:3242 [inline] kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247 dst_alloc+0x146/0x1f0 net/core/dst.c:92 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 __mkroute_output net/ipv4/route.c:2564 [inline] ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791 ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619 __ip_route_output_key include/net/route.h:126 [inline] ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850 ip_route_output_key include/net/route.h:142 [inline] geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809 geneve_xmit_skb drivers/net/geneve.c:899 [inline] geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082 __netdev_start_xmit include/linux/netdevice.h:4994 [inline] netdev_start_xmit include/linux/netdevice.h:5008 [inline] xmit_one net/core/dev.c:3590 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606 __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1338 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389 free_unref_page_prepare mm/page_alloc.c:3309 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3388 qlink_free mm/kasan/quarantine.c:146 [inline] qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272 __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444 kasan_slab_alloc include/linux/kasan.h:259 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3234 [inline] kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270 __alloc_skb+0x215/0x340 net/core/skbuff.c:414 alloc_skb include/linux/skbuff.h:1126 [inline] alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078 sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575 mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754 add_grhead+0x265/0x330 net/ipv6/mcast.c:1857 add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995 mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242 mld_send_initial_cr net/ipv6/mcast.c:1232 [inline] mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 Memory state around the buggy address: ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+0
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-17add missing bpf-cgroup.h includesJakub Kicinski1-0/+1
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency, make sure those who actually need more than the definition of struct cgroup_bpf include bpf-cgroup.h explicitly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
2021-12-16sit: do not call ipip6_dev_free() from sit_init_net()Eric Dumazet1-1/+0
ipip6_dev_free is sit dev->priv_destructor, already called by register_netdevice() if something goes wrong. Alternative would be to make ipip6_dev_free() robust against multiple invocations, but other drivers do not implement this strategy. syzbot reported: dst_release underflow WARNING: CPU: 0 PID: 5059 at net/core/dst.c:173 dst_release+0xd8/0xe0 net/core/dst.c:173 Modules linked in: CPU: 1 PID: 5059 Comm: syz-executor.4 Not tainted 5.16.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dst_release+0xd8/0xe0 net/core/dst.c:173 Code: 4c 89 f2 89 d9 31 c0 5b 41 5e 5d e9 da d5 44 f9 e8 1d 90 5f f9 c6 05 87 48 c6 05 01 48 c7 c7 80 44 99 8b 31 c0 e8 e8 67 29 f9 <0f> 0b eb 85 0f 1f 40 00 53 48 89 fb e8 f7 8f 5f f9 48 83 c3 a8 48 RSP: 0018:ffffc9000aa5faa0 EFLAGS: 00010246 RAX: d6894a925dd15a00 RBX: 00000000ffffffff RCX: 0000000000040000 RDX: ffffc90005e19000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 0000000000000000 R08: ffffffff816a1f42 R09: ffffed1017344f2c R10: ffffed1017344f2c R11: 0000000000000000 R12: 0000607f462b1358 R13: 1ffffffff1bfd305 R14: ffffe8ffffcb1358 R15: dffffc0000000000 FS: 00007f66c71a2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f88aaed5058 CR3: 0000000023e0f000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dst_cache_destroy+0x107/0x1e0 net/core/dst_cache.c:160 ipip6_dev_free net/ipv6/sit.c:1414 [inline] sit_init_net+0x229/0x550 net/ipv6/sit.c:1936 ops_init+0x313/0x430 net/core/net_namespace.c:140 setup_net+0x35b/0x9d0 net/core/net_namespace.c:326 copy_net_ns+0x359/0x5c0 net/core/net_namespace.c:470 create_new_namespaces+0x4ce/0xa00 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0x11e/0x180 kernel/nsproxy.c:226 ksys_unshare+0x57d/0xb50 kernel/fork.c:3075 __do_sys_unshare kernel/fork.c:3146 [inline] __se_sys_unshare kernel/fork.c:3144 [inline] __x64_sys_unshare+0x34/0x40 kernel/fork.c:3144 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f66c882ce99 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f66c71a2168 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f66c893ff60 RCX: 00007f66c882ce99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000048040200 RBP: 00007f66c8886ff1 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fff6634832f R14: 00007f66c71a2300 R15: 0000000000022000 </TASK> Fixes: cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211216111741.1387540-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16fib: rules: remove duplicated nla policiesFlorian Westphal2-10/+0
The attributes are identical in all implementations so move the ipv4 one into the core and remove the per-family nla policies. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-15ipv6: use GFP_ATOMIC in rt6_probe()Eric Dumazet1-1/+1
syzbot reminded me that rt6_probe() can run from atomic contexts. stack backtrace: CPU: 1 PID: 7461 Comm: syz-executor.2 Not tainted 5.16.0-rc4-next-20211210-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:203 [inline] valid_state kernel/locking/lockdep.c:3945 [inline] mark_lock_irq kernel/locking/lockdep.c:4148 [inline] mark_lock.cold+0x61/0x8e kernel/locking/lockdep.c:4605 mark_usage kernel/locking/lockdep.c:4500 [inline] __lock_acquire+0x11d5/0x54a0 kernel/locking/lockdep.c:4981 lock_acquire kernel/locking/lockdep.c:5639 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604 __fs_reclaim_acquire mm/page_alloc.c:4550 [inline] fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4564 might_alloc include/linux/sched/mm.h:253 [inline] slab_pre_alloc_hook mm/slab.h:739 [inline] slab_alloc_node mm/slub.c:3145 [inline] slab_alloc mm/slub.c:3239 [inline] kmem_cache_alloc_trace+0x3b/0x2c0 mm/slub.c:3256 kmalloc include/linux/slab.h:581 [inline] kzalloc include/linux/slab.h:715 [inline] ref_tracker_alloc+0xe1/0x430 lib/ref_tracker.c:74 netdev_tracker_alloc include/linux/netdevice.h:3860 [inline] dev_hold_track include/linux/netdevice.h:3877 [inline] rt6_probe net/ipv6/route.c:661 [inline] find_match.part.0+0xac9/0xd00 net/ipv6/route.c:752 find_match net/ipv6/route.c:825 [inline] __find_rr_leaf+0x17f/0xd20 net/ipv6/route.c:826 find_rr_leaf net/ipv6/route.c:847 [inline] rt6_select net/ipv6/route.c:891 [inline] fib6_table_lookup+0x649/0xa20 net/ipv6/route.c:2185 ip6_pol_route+0x1c5/0x11e0 net/ipv6/route.c:2221 pol_lookup_func include/net/ip6_fib.h:580 [inline] fib6_rule_lookup+0x52a/0x6f0 net/ipv6/fib6_rules.c:120 ip6_route_output_flags_noref+0x2e2/0x380 net/ipv6/route.c:2629 ip6_route_output_flags+0x72/0x320 net/ipv6/route.c:2642 ip6_route_output include/net/ip6_route.h:98 [inline] ip6_dst_lookup_tail+0x5ab/0x1620 net/ipv6/ip6_output.c:1070 ip6_dst_lookup_flow+0x8c/0x1d0 net/ipv6/ip6_output.c:1200 geneve_get_v6_dst+0x46f/0x9a0 drivers/net/geneve.c:858 geneve6_xmit_skb drivers/net/geneve.c:991 [inline] geneve_xmit+0x520/0x3530 drivers/net/geneve.c:1074 __netdev_start_xmit include/linux/netdevice.h:4685 [inline] netdev_start_xmit include/linux/netdevice.h:4699 [inline] xmit_one net/core/dev.c:3473 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489 __dev_queue_xmit+0x2983/0x3640 net/core/dev.c:4112 neigh_resolve_output net/core/neighbour.c:1522 [inline] neigh_resolve_output+0x50e/0x820 net/core/neighbour.c:1502 neigh_output include/net/neighbour.h:541 [inline] ip6_finish_output2+0x56e/0x14f0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe80 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508 ndisc_send_rs+0x12e/0x6f0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x3f2/0x820 net/ipv6/addrconf.c:3898 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421 expire_timers kernel/time/timer.c:1466 [inline] __run_timers.part.0+0x675/0xa20 kernel/time/timer.c:1734 __run_timers kernel/time/timer.c:1715 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> Fixes: fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211214025806.3456382-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+8
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09xfrm: use net device refcount tracker helpersEric Dumazet1-2/+2
xfrm4_fill_dst() and xfrm6_fill_dst() build dst, getting a device reference that will likely be released by standard dst_release() code. We have to track these references or risk a warning if CONFIG_NET_DEV_REFCNT_TRACKER=y Note to XFRM maintainers : Error path in xfrm6_fill_dst() releases the reference, but does not clear xdst->u.dst.dev, so I wonder if this could lead to double dev_put() in some cases, where a dst_release() _is_ called by the callers in their error path. This extra dev_put() was added in commit 84c4a9dfbf430 ("xfrm6: release dev before returning error") Fixes: 9038c320001d ("net: dst: add net device refcount tracking to dst_entry") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <amwang@redhat.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Link: https://lore.kernel.org/r/20211207193203.2706158-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09seg6: fix the iif in the IPv6 socket control blockAndrea Mayer1-0/+8
When an IPv4 packet is received, the ip_rcv_core(...) sets the receiving interface index into the IPv4 socket control block (v5.16-rc4, net/ipv4/ip_input.c line 510): IPCB(skb)->iif = skb->skb_iif; If that IPv4 packet is meant to be encapsulated in an outer IPv6+SRH header, the seg6_do_srh_encap(...) performs the required encapsulation. In this case, the seg6_do_srh_encap function clears the IPv6 socket control block (v5.16-rc4 net/ipv6/seg6_iptunnel.c line 163): memset(IP6CB(skb), 0, sizeof(*IP6CB(skb))); The memset(...) was introduced in commit ef489749aae5 ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") a long time ago (2019-01-29). Since the IPv6 socket control block and the IPv4 socket control block share the same memory area (skb->cb), the receiving interface index info is lost (IP6CB(skb)->iif is set to zero). As a side effect, that condition triggers a NULL pointer dereference if commit 0857d6f8c759 ("ipv6: When forwarding count rx stats on the orig netdev") is applied. To fix that issue, we set the IP6CB(skb)->iif with the index of the receiving interface once again. Fixes: ef489749aae5 ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211208195409.12169-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08inet: add net device refcount tracker to struct fib_nh_commonEric Dumazet1-0/+2
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07ipmr, ip6mr: add net device refcount tracker to struct vif_deviceEric Dumazet1-1/+2
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07ipv6: add net device refcount tracker to struct inet6_devEric Dumazet2-3/+3
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07ipv6: add net device refcount tracker to struct ip6_tnlEric Dumazet3-8/+8
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>