summaryrefslogtreecommitdiff
path: root/drivers/net/ipvlan
AgeCommit message (Collapse)AuthorFilesLines
2024-06-12ipvlan: Dont Use skb->sk in ipvlan_process_v{4,6}_outboundYue Haibing1-2/+2
[ Upstream commit b3dc6e8003b500861fa307e9a3400c52e78e4d3a ] Raw packet from PF_PACKET socket ontop of an IPv6-backed ipvlan device will hit WARN_ON_ONCE() in sk_mc_loop() through sch_direct_xmit() path. WARNING: CPU: 2 PID: 0 at net/core/sock.c:775 sk_mc_loop+0x2d/0x70 Modules linked in: sch_netem ipvlan rfkill cirrus drm_shmem_helper sg drm_kms_helper CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 6.9.0+ #279 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:sk_mc_loop+0x2d/0x70 Code: fa 0f 1f 44 00 00 65 0f b7 15 f7 96 a3 4f 31 c0 66 85 d2 75 26 48 85 ff 74 1c RSP: 0018:ffffa9584015cd78 EFLAGS: 00010212 RAX: 0000000000000011 RBX: ffff91e585793e00 RCX: 0000000002c6a001 RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff91e589c0f000 RBP: ffff91e5855bd100 R08: 0000000000000000 R09: 3d00545216f43d00 R10: ffff91e584fdcc50 R11: 00000060dd8616f4 R12: ffff91e58132d000 R13: ffff91e584fdcc68 R14: ffff91e5869ce800 R15: ffff91e589c0f000 FS: 0000000000000000(0000) GS:ffff91e898100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f788f7c44c0 CR3: 0000000008e1a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? __warn (kernel/panic.c:693) ? sk_mc_loop (net/core/sock.c:760) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:239) ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? sk_mc_loop (net/core/sock.c:760) ip6_finish_output2 (net/ipv6/ip6_output.c:83 (discriminator 1)) ? nf_hook_slow (net/netfilter/core.c:626) ip6_finish_output (net/ipv6/ip6_output.c:222) ? __pfx_ip6_finish_output (net/ipv6/ip6_output.c:215) ipvlan_xmit_mode_l3 (drivers/net/ipvlan/ipvlan_core.c:602) ipvlan ipvlan_start_xmit (drivers/net/ipvlan/ipvlan_main.c:226) ipvlan dev_hard_start_xmit (net/core/dev.c:3594) sch_direct_xmit (net/sched/sch_generic.c:343) __qdisc_run (net/sched/sch_generic.c:416) net_tx_action (net/core/dev.c:5286) handle_softirqs (kernel/softirq.c:555) __irq_exit_rcu (kernel/softirq.c:589) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1043) The warning triggers as this: packet_sendmsg packet_snd //skb->sk is packet sk __dev_queue_xmit __dev_xmit_skb //q->enqueue is not NULL __qdisc_run sch_direct_xmit dev_hard_start_xmit ipvlan_start_xmit ipvlan_xmit_mode_l3 //l3 mode ipvlan_process_outbound //vepa flag ipvlan_process_v6_outbound ip6_local_out __ip6_finish_output ip6_finish_output2 //multicast packet sk_mc_loop //sk->sk_family is AF_PACKET Call ip{6}_local_out() with NULL sk in ipvlan as other tunnels to fix this. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240529095633.613103-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28ipvlan: add ipvlan_route_v6_outbound() helperEric Dumazet1-16/+25
[ Upstream commit 18f039428c7df183b09c69ebf10ffd4e521035d2 ] Inspired by syzbot reports using a stack of multiple ipvlan devices. Reduce stack size needed in ipvlan_process_v6_outbound() by moving the flowi6 struct used for the route lookup in an non inlined helper. ipvlan_route_v6_outbound() needs 120 bytes on the stack, immediately reclaimed. Also make sure ipvlan_process_v4_outbound() is not inlined. We might also have to lower MAX_NEST_DEV, because only syzbot uses setups with more than four stacked devices. BUG: TASK stack guard page was hit at ffffc9000e803ff8 (stack is ffffc9000e804000..ffffc9000e808000) stack guard page: 0000 [#1] SMP KASAN CPU: 0 PID: 13442 Comm: syz-executor.4 Not tainted 6.1.52-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 RIP: 0010:kasan_check_range+0x4/0x2a0 mm/kasan/generic.c:188 Code: 48 01 c6 48 89 c7 e8 db 4e c1 03 31 c0 5d c3 cc 0f 0b eb 02 0f 0b b8 ea ff ff ff 5d c3 cc 00 00 cc cc 00 00 cc cc 55 48 89 e5 <41> 57 41 56 41 55 41 54 53 b0 01 48 85 f6 0f 84 a4 01 00 00 48 89 RSP: 0018:ffffc9000e804000 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff817e5bf2 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff887c6568 RBP: ffffc9000e804000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: dffffc0000000001 R12: 1ffff92001d0080c R13: dffffc0000000000 R14: ffffffff87e6b100 R15: 0000000000000000 FS: 00007fd0c55826c0(0000) GS:ffff8881f6800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000e803ff8 CR3: 0000000170ef7000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <#DF> </#DF> <TASK> [<ffffffff81f281d1>] __kasan_check_read+0x11/0x20 mm/kasan/shadow.c:31 [<ffffffff817e5bf2>] instrument_atomic_read include/linux/instrumented.h:72 [inline] [<ffffffff817e5bf2>] _test_bit include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline] [<ffffffff817e5bf2>] cpumask_test_cpu include/linux/cpumask.h:506 [inline] [<ffffffff817e5bf2>] cpu_online include/linux/cpumask.h:1092 [inline] [<ffffffff817e5bf2>] trace_lock_acquire include/trace/events/lock.h:24 [inline] [<ffffffff817e5bf2>] lock_acquire+0xe2/0x590 kernel/locking/lockdep.c:5632 [<ffffffff8563221e>] rcu_lock_acquire+0x2e/0x40 include/linux/rcupdate.h:306 [<ffffffff8561464d>] rcu_read_lock include/linux/rcupdate.h:747 [inline] [<ffffffff8561464d>] ip6_pol_route+0x15d/0x1440 net/ipv6/route.c:2221 [<ffffffff85618120>] ip6_pol_route_output+0x50/0x80 net/ipv6/route.c:2606 [<ffffffff856f65b5>] pol_lookup_func include/net/ip6_fib.h:584 [inline] [<ffffffff856f65b5>] fib6_rule_lookup+0x265/0x620 net/ipv6/fib6_rules.c:116 [<ffffffff85618009>] ip6_route_output_flags_noref+0x2d9/0x3a0 net/ipv6/route.c:2638 [<ffffffff8561821a>] ip6_route_output_flags+0xca/0x340 net/ipv6/route.c:2651 [<ffffffff838bd5a3>] ip6_route_output include/net/ip6_route.h:100 [inline] [<ffffffff838bd5a3>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:473 [inline] [<ffffffff838bd5a3>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [<ffffffff838bd5a3>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [<ffffffff838bd5a3>] ipvlan_queue_xmit+0xc33/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [<ffffffff838c2909>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [<ffffffff84d03900>] netdev_start_xmit include/linux/netdevice.h:4966 [inline] [<ffffffff84d03900>] xmit_one net/core/dev.c:3644 [inline] [<ffffffff84d03900>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [<ffffffff84d080e2>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324 [<ffffffff855ce4cd>] dev_queue_xmit include/linux/netdevice.h:3067 [inline] [<ffffffff855ce4cd>] neigh_hh_output include/net/neighbour.h:529 [inline] [<ffffffff855ce4cd>] neigh_output include/net/neighbour.h:543 [inline] [<ffffffff855ce4cd>] ip6_finish_output2+0x160d/0x1ae0 net/ipv6/ip6_output.c:139 [<ffffffff855b8616>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [<ffffffff855b8616>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [<ffffffff855b7e3c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [<ffffffff855b7e3c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [<ffffffff8575d27f>] dst_output include/net/dst.h:444 [inline] [<ffffffff8575d27f>] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161 [<ffffffff838bdae4>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline] [<ffffffff838bdae4>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [<ffffffff838bdae4>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [<ffffffff838bdae4>] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [<ffffffff838c2909>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [<ffffffff84d03900>] netdev_start_xmit include/linux/netdevice.h:4966 [inline] [<ffffffff84d03900>] xmit_one net/core/dev.c:3644 [inline] [<ffffffff84d03900>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [<ffffffff84d080e2>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324 [<ffffffff855ce4cd>] dev_queue_xmit include/linux/netdevice.h:3067 [inline] [<ffffffff855ce4cd>] neigh_hh_output include/net/neighbour.h:529 [inline] [<ffffffff855ce4cd>] neigh_output include/net/neighbour.h:543 [inline] [<ffffffff855ce4cd>] ip6_finish_output2+0x160d/0x1ae0 net/ipv6/ip6_output.c:139 [<ffffffff855b8616>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [<ffffffff855b8616>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [<ffffffff855b7e3c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [<ffffffff855b7e3c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [<ffffffff8575d27f>] dst_output include/net/dst.h:444 [inline] [<ffffffff8575d27f>] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161 [<ffffffff838bdae4>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline] [<ffffffff838bdae4>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [<ffffffff838bdae4>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [<ffffffff838bdae4>] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [<ffffffff838c2909>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [<ffffffff84d03900>] netdev_start_xmit include/linux/netdevice.h:4966 [inline] [<ffffffff84d03900>] xmit_one net/core/dev.c:3644 [inline] [<ffffffff84d03900>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [<ffffffff84d080e2>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324 [<ffffffff855ce4cd>] dev_queue_xmit include/linux/netdevice.h:3067 [inline] [<ffffffff855ce4cd>] neigh_hh_output include/net/neighbour.h:529 [inline] [<ffffffff855ce4cd>] neigh_output include/net/neighbour.h:543 [inline] [<ffffffff855ce4cd>] ip6_finish_output2+0x160d/0x1ae0 net/ipv6/ip6_output.c:139 [<ffffffff855b8616>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [<ffffffff855b8616>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [<ffffffff855b7e3c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [<ffffffff855b7e3c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [<ffffffff8575d27f>] dst_output include/net/dst.h:444 [inline] [<ffffffff8575d27f>] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161 [<ffffffff838bdae4>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline] [<ffffffff838bdae4>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [<ffffffff838bdae4>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [<ffffffff838bdae4>] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [<ffffffff838c2909>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [<ffffffff84d03900>] netdev_start_xmit include/linux/netdevice.h:4966 [inline] [<ffffffff84d03900>] xmit_one net/core/dev.c:3644 [inline] [<ffffffff84d03900>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [<ffffffff84d080e2>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324 [<ffffffff855ce4cd>] dev_queue_xmit include/linux/netdevice.h:3067 [inline] [<ffffffff855ce4cd>] neigh_hh_output include/net/neighbour.h:529 [inline] [<ffffffff855ce4cd>] neigh_output include/net/neighbour.h:543 [inline] [<ffffffff855ce4cd>] ip6_finish_output2+0x160d/0x1ae0 net/ipv6/ip6_output.c:139 [<ffffffff855b8616>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [<ffffffff855b8616>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [<ffffffff855b7e3c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [<ffffffff855b7e3c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [<ffffffff8575d27f>] dst_output include/net/dst.h:444 [inline] [<ffffffff8575d27f>] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161 [<ffffffff838bdae4>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline] [<ffffffff838bdae4>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [<ffffffff838bdae4>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [<ffffffff838bdae4>] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [<ffffffff838c2909>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [<ffffffff84d03900>] netdev_start_xmit include/linux/netdevice.h:4966 [inline] [<ffffffff84d03900>] xmit_one net/core/dev.c:3644 [inline] [<ffffffff84d03900>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [<ffffffff84d080e2>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324 [<ffffffff84d4a65e>] dev_queue_xmit include/linux/netdevice.h:3067 [inline] [<ffffffff84d4a65e>] neigh_resolve_output+0x64e/0x750 net/core/neighbour.c:1560 [<ffffffff855ce503>] neigh_output include/net/neighbour.h:545 [inline] [<ffffffff855ce503>] ip6_finish_output2+0x1643/0x1ae0 net/ipv6/ip6_output.c:139 [<ffffffff855b8616>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [<ffffffff855b8616>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [<ffffffff855b7e3c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [<ffffffff855b7e3c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [<ffffffff855b9ce4>] dst_output include/net/dst.h:444 [inline] [<ffffffff855b9ce4>] NF_HOOK include/linux/netfilter.h:309 [inline] [<ffffffff855b9ce4>] ip6_xmit+0x11a4/0x1b20 net/ipv6/ip6_output.c:352 [<ffffffff8597984e>] sctp_v6_xmit+0x9ae/0x1230 net/sctp/ipv6.c:250 [<ffffffff8594623e>] sctp_packet_transmit+0x25de/0x2bc0 net/sctp/output.c:653 [<ffffffff858f5142>] sctp_packet_singleton+0x202/0x310 net/sctp/outqueue.c:783 [<ffffffff858ea411>] sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline] [<ffffffff858ea411>] sctp_outq_flush+0x661/0x3d40 net/sctp/outqueue.c:1212 [<ffffffff858f02f9>] sctp_outq_uncork+0x79/0xb0 net/sctp/outqueue.c:764 [<ffffffff8589f060>] sctp_side_effects net/sctp/sm_sideeffect.c:1199 [inline] [<ffffffff8589f060>] sctp_do_sm+0x55c0/0x5c30 net/sctp/sm_sideeffect.c:1170 [<ffffffff85941567>] sctp_primitive_ASSOCIATE+0x97/0xc0 net/sctp/primitive.c:73 [<ffffffff859408b2>] sctp_sendmsg_to_asoc+0xf62/0x17b0 net/sctp/socket.c:1839 [<ffffffff85910b5e>] sctp_sendmsg+0x212e/0x33b0 net/sctp/socket.c:2029 [<ffffffff8544d559>] inet_sendmsg+0x149/0x310 net/ipv4/af_inet.c:849 [<ffffffff84c6c4d2>] sock_sendmsg_nosec net/socket.c:716 [inline] [<ffffffff84c6c4d2>] sock_sendmsg net/socket.c:736 [inline] [<ffffffff84c6c4d2>] ____sys_sendmsg+0x572/0x8c0 net/socket.c:2504 [<ffffffff84c6ca91>] ___sys_sendmsg net/socket.c:2558 [inline] [<ffffffff84c6ca91>] __sys_sendmsg+0x271/0x360 net/socket.c:2587 [<ffffffff84c6cbff>] __do_sys_sendmsg net/socket.c:2596 [inline] [<ffffffff84c6cbff>] __se_sys_sendmsg net/socket.c:2594 [inline] [<ffffffff84c6cbff>] __x64_sys_sendmsg+0x7f/0x90 net/socket.c:2594 [<ffffffff85b32553>] do_syscall_x64 arch/x86/entry/common.c:51 [inline] [<ffffffff85b32553>] do_syscall_64+0x53/0x80 arch/x86/entry/common.c:84 [<ffffffff85c00087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20ipvlan: properly track tx_errorsEric Dumazet2-4/+5
[ Upstream commit ff672b9ffeb3f82135488ac16c5c5eb4b992999b ] Both ipvlan_process_v4_outbound() and ipvlan_process_v6_outbound() increment dev->stats.tx_errors in case of errors. Unfortunately there are two issues : 1) ipvlan_get_stats64() does not propagate dev->stats.tx_errors to user. 2) Increments are not atomic. KCSAN would complain eventually. Use DEV_STATS_INC() to not miss an update, and change ipvlan_get_stats64() to copy the value back to user. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Link: https://lore.kernel.org/r/20231026131446.3933175-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30ipvlan: Fix a reference count leak warning in ipvlan_ns_exit()Lu Wei1-1/+2
[ Upstream commit 043d5f68d0ccdda91029b4b6dce7eeffdcfad281 ] There are two network devices(veth1 and veth3) in ns1, and ipvlan1 with L3S mode and ipvlan2 with L2 mode are created based on them as figure (1). In this case, ipvlan_register_nf_hook() will be called to register nf hook which is needed by ipvlans in L3S mode in ns1 and value of ipvl_nf_hook_refcnt is set to 1. (1) ns1 ns2 ------------ ------------ veth1--ipvlan1 (L3S) veth3--ipvlan2 (L2) (2) ns1 ns2 ------------ ------------ veth1--ipvlan1 (L3S) ipvlan2 (L2) veth3 | | |------->-------->--------->-------- migrate When veth3 migrates from ns1 to ns2 as figure (2), veth3 will register in ns2 and calls call_netdevice_notifiers with NETDEV_REGISTER event: dev_change_net_namespace call_netdevice_notifiers ipvlan_device_event ipvlan_migrate_l3s_hook ipvlan_register_nf_hook(newnet) (I) ipvlan_unregister_nf_hook(oldnet) (II) In function ipvlan_migrate_l3s_hook(), ipvl_nf_hook_refcnt in ns1 is not 0 since veth1 with ipvlan1 still in ns1, (I) and (II) will be called to register nf_hook in ns2 and unregister nf_hook in ns1. As a result, ipvl_nf_hook_refcnt in ns1 is decreased incorrectly and this in ns2 is increased incorrectly. When the second net namespace is removed, a reference count leak warning in ipvlan_ns_exit() will be triggered. This patch add a check before ipvlan_migrate_l3s_hook() is called. The warning can be triggered as follows: $ ip netns add ns1 $ ip netns add ns2 $ ip netns exec ns1 ip link add veth1 type veth peer name veth2 $ ip netns exec ns1 ip link add veth3 type veth peer name veth4 $ ip netns exec ns1 ip link add ipv1 link veth1 type ipvlan mode l3s $ ip netns exec ns1 ip link add ipv2 link veth3 type ipvlan mode l2 $ ip netns exec ns1 ip link set veth3 netns ns2 $ ip net del ns2 Fixes: 3133822f5ac1 ("ipvlan: use pernet operations and restrict l3s hooks to master netns") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230817145449.141827-1-luwei32@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19ipvlan: Fix return value of ipvlan_queue_xmit()Cambda Zhu1-3/+6
[ Upstream commit 8a9922e7be6d042fa00f894c376473b17a162b66 ] ipvlan_queue_xmit() should return NET_XMIT_XXX, but ipvlan_xmit_mode_l2/l3() returns rx_handler_result_t or NET_RX_XXX in some cases. ipvlan_rcv_frame() will only return RX_HANDLER_CONSUMED in ipvlan_xmit_mode_l2/l3() because 'local' is true. It's equal to NET_XMIT_SUCCESS. But dev_forward_skb() can return NET_RX_SUCCESS or NET_RX_DROP, and returning NET_RX_DROP(NET_XMIT_DROP) will increase both ipvlan and ipvlan->phy_dev drops counter. The skb to forward can be treated as xmitted successfully. This patch makes ipvlan_queue_xmit() return NET_XMIT_SUCCESS for forward skb. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Link: https://lore.kernel.org/r/20230626093347.7492-1-cambda@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21ipvlan: fix bound dev checking for IPv6 l3s modeHangbin Liu1-0/+4
[ Upstream commit ce57adc222aba32431c42632b396e9213d0eb0b8 ] The commit 59a0b022aa24 ("ipvlan: Make skb->skb_iif track skb->dev for l3s mode") fixed ipvlan bonded dev checking by updating skb skb_iif. This fix works for IPv4, as in raw_v4_input() the dif is from inet_iif(skb), which is skb->skb_iif when there is no route. But for IPv6, the fix is not enough, because in ipv6_raw_deliver() -> raw_v6_match(), the dif is inet6_iif(skb), which is returns IP6CB(skb)->iif instead of skb->skb_iif if it's not a l3_slave. To fix the IPv6 part issue. Let's set IP6CB(skb)->iif to correct ifindex. BTW, ipvlan handles NS/NA specifically. Since it works fine, I will not reset IP6CB(skb)->iif when addr->atype is IPVL_ICMPV6. Fixes: c675e06a98a4 ("ipvlan: decouple l3s mode dependencies from other modes") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2196710 Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24ipvlan:Fix out-of-bounds caused by unclear skb->cbt.feng1-0/+6
[ Upstream commit 90cbed5247439a966b645b34eb0a2e037836ea8e ] If skb enqueue the qdisc, fq_skb_cb(skb)->time_to_send is changed which is actually skb->cb, and IPCB(skb_in)->opt will be used in __ip_options_echo. It is possible that memcpy is out of bounds and lead to stack overflow. We should clear skb->cb before ip_local_out or ip6_local_out. v2: 1. clean the stack info 2. use IPCB/IP6CB instead of skb->cb crash on stable-5.10(reproduce in kasan kernel). Stack info: [ 2203.651571] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x589/0x800 [ 2203.653327] Write of size 4 at addr ffff88811a388f27 by task swapper/3/0 [ 2203.655460] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.10.0-60.18.0.50.h856.kasan.eulerosv2r11.x86_64 #1 [ 2203.655466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014 [ 2203.655475] Call Trace: [ 2203.655481] <IRQ> [ 2203.655501] dump_stack+0x9c/0xd3 [ 2203.655514] print_address_description.constprop.0+0x19/0x170 [ 2203.655530] __kasan_report.cold+0x6c/0x84 [ 2203.655586] kasan_report+0x3a/0x50 [ 2203.655594] check_memory_region+0xfd/0x1f0 [ 2203.655601] memcpy+0x39/0x60 [ 2203.655608] __ip_options_echo+0x589/0x800 [ 2203.655654] __icmp_send+0x59a/0x960 [ 2203.655755] nf_send_unreach+0x129/0x3d0 [nf_reject_ipv4] [ 2203.655763] reject_tg+0x77/0x1bf [ipt_REJECT] [ 2203.655772] ipt_do_table+0x691/0xa40 [ip_tables] [ 2203.655821] nf_hook_slow+0x69/0x100 [ 2203.655828] __ip_local_out+0x21e/0x2b0 [ 2203.655857] ip_local_out+0x28/0x90 [ 2203.655868] ipvlan_process_v4_outbound+0x21e/0x260 [ipvlan] [ 2203.655931] ipvlan_xmit_mode_l3+0x3bd/0x400 [ipvlan] [ 2203.655967] ipvlan_queue_xmit+0xb3/0x190 [ipvlan] [ 2203.655977] ipvlan_start_xmit+0x2e/0xb0 [ipvlan] [ 2203.655984] xmit_one.constprop.0+0xe1/0x280 [ 2203.655992] dev_hard_start_xmit+0x62/0x100 [ 2203.656000] sch_direct_xmit+0x215/0x640 [ 2203.656028] __qdisc_run+0x153/0x1f0 [ 2203.656069] __dev_queue_xmit+0x77f/0x1030 [ 2203.656173] ip_finish_output2+0x59b/0xc20 [ 2203.656244] __ip_finish_output.part.0+0x318/0x3d0 [ 2203.656312] ip_finish_output+0x168/0x190 [ 2203.656320] ip_output+0x12d/0x220 [ 2203.656357] __ip_queue_xmit+0x392/0x880 [ 2203.656380] __tcp_transmit_skb+0x1088/0x11c0 [ 2203.656436] __tcp_retransmit_skb+0x475/0xa30 [ 2203.656505] tcp_retransmit_skb+0x2d/0x190 [ 2203.656512] tcp_retransmit_timer+0x3af/0x9a0 [ 2203.656519] tcp_write_timer_handler+0x3ba/0x510 [ 2203.656529] tcp_write_timer+0x55/0x180 [ 2203.656542] call_timer_fn+0x3f/0x1d0 [ 2203.656555] expire_timers+0x160/0x200 [ 2203.656562] run_timer_softirq+0x1f4/0x480 [ 2203.656606] __do_softirq+0xfd/0x402 [ 2203.656613] asm_call_irq_on_stack+0x12/0x20 [ 2203.656617] </IRQ> [ 2203.656623] do_softirq_own_stack+0x37/0x50 [ 2203.656631] irq_exit_rcu+0x134/0x1a0 [ 2203.656639] sysvec_apic_timer_interrupt+0x36/0x80 [ 2203.656646] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 2203.656654] RIP: 0010:default_idle+0x13/0x20 [ 2203.656663] Code: 89 f0 5d 41 5c 41 5d 41 5e c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 0f 1f 44 00 00 0f 00 2d 9f 32 57 00 fb f4 <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 be 08 [ 2203.656668] RSP: 0018:ffff88810036fe78 EFLAGS: 00000256 [ 2203.656676] RAX: ffffffffaf2a87f0 RBX: ffff888100360000 RCX: ffffffffaf290191 [ 2203.656681] RDX: 0000000000098b5e RSI: 0000000000000004 RDI: ffff88811a3c4f60 [ 2203.656686] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88811a3c4f63 [ 2203.656690] R10: ffffed10234789ec R11: 0000000000000001 R12: 0000000000000003 [ 2203.656695] R13: ffff888100360000 R14: 0000000000000000 R15: 0000000000000000 [ 2203.656729] default_idle_call+0x5a/0x150 [ 2203.656735] cpuidle_idle_call+0x1c6/0x220 [ 2203.656780] do_idle+0xab/0x100 [ 2203.656786] cpu_startup_entry+0x19/0x20 [ 2203.656793] secondary_startup_64_no_verify+0xc2/0xcb [ 2203.657409] The buggy address belongs to the page: [ 2203.658648] page:0000000027a9842f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11a388 [ 2203.658665] flags: 0x17ffffc0001000(reserved|node=0|zone=2|lastcpupid=0x1fffff) [ 2203.658675] raw: 0017ffffc0001000 ffffea000468e208 ffffea000468e208 0000000000000000 [ 2203.658682] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 2203.658686] page dumped because: kasan: bad access detected To reproduce(ipvlan with IPVLAN_MODE_L3): Env setting: ======================================================= modprobe ipvlan ipvlan_default_mode=1 sysctl net.ipv4.conf.eth0.forwarding=1 iptables -t nat -A POSTROUTING -s 20.0.0.0/255.255.255.0 -o eth0 -j MASQUERADE ip link add gw link eth0 type ipvlan ip -4 addr add 20.0.0.254/24 dev gw ip netns add net1 ip link add ipv1 link eth0 type ipvlan ip link set ipv1 netns net1 ip netns exec net1 ip link set ipv1 up ip netns exec net1 ip -4 addr add 20.0.0.4/24 dev ipv1 ip netns exec net1 route add default gw 20.0.0.254 ip netns exec net1 tc qdisc add dev ipv1 root netem loss 10% ifconfig gw up iptables -t filter -A OUTPUT -p tcp --dport 8888 -j REJECT --reject-with icmp-port-unreachable ======================================================= And then excute the shell(curl any address of eth0 can reach): for((i=1;i<=100000;i++)) do ip netns exec net1 curl x.x.x.x:8888 done ======================================================= Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: "t.feng" <fengtao40@huawei.com> Suggested-by: Florian Westphal <fw@strlen.de> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-22ipvlan: Make skb->skb_iif track skb->dev for l3s modeJianguo Wu1-0/+1
[ Upstream commit 59a0b022aa249e3f5735d93de0849341722c4754 ] For l3s mode, skb->dev is set to ipvlan interface in ipvlan_nf_input(): skb->dev = addr->master->dev but, skb->skb_iif remain unchanged, this will cause socket lookup failed if a target socket is bound to a interface, like the following example: ip link add ipvlan0 link eth0 type ipvlan mode l3s ip addr add dev ipvlan0 192.168.124.111/24 ip link set ipvlan0 up ping -c 1 -I ipvlan0 8.8.8.8 100% packet loss This is because there is no match sk in __raw_v4_lookup() as sk->sk_bound_dev_if != dif(skb->skb_iif). Fix this by make skb->skb_iif track skb->dev in ipvlan_nf_input(). Fixes: c675e06a98a4 ("ipvlan: decouple l3s mode dependencies from other modes") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/29865b1f-6db7-c07a-de89-949d3721ea30@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-18ipvlan: hold lower dev to avoid possible use-after-freeMahesh Bandewar2-0/+3
Recently syzkaller discovered the issue of disappearing lower device (NETDEV_UNREGISTER) while the virtual device (like macvlan) is still having it as a lower device. So it's just a matter of time similar discovery will be made for IPvlan device setup. So fixing it preemptively. Also while at it, add a refcount tracker. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+4
drivers/net/ethernet/freescale/fec.h 7b15515fc1ca ("Revert "fec: Restart PPS after link state change"") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c c297561bc98a ("pinctrl: ocelot: Fix interrupt controller") 181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile bbb774d921e2 ("net: Add tests for bonding and team address list management") 152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c 5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-09ipvlan: Fix out-of-bound bugs caused by unset skb->mac_headerLu Wei1-2/+4
If an AF_PACKET socket is used to send packets through ipvlan and the default xmit function of the AF_PACKET socket is changed from dev_queue_xmit() to packet_direct_xmit() via setsockopt() with the option name of PACKET_QDISC_BYPASS, the skb->mac_header may not be reset and remains as the initial value of 65535, this may trigger slab-out-of-bounds bugs as following: ================================================================= UG: KASAN: slab-out-of-bounds in ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan] PU: 2 PID: 1768 Comm: raw_send Kdump: loaded Not tainted 6.0.0-rc4+ #6 ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 all Trace: print_address_description.constprop.0+0x1d/0x160 print_report.cold+0x4f/0x112 kasan_report+0xa3/0x130 ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan] ipvlan_start_xmit+0x29/0xa0 [ipvlan] __dev_direct_xmit+0x2e2/0x380 packet_direct_xmit+0x22/0x60 packet_snd+0x7c9/0xc40 sock_sendmsg+0x9a/0xa0 __sys_sendto+0x18a/0x230 __x64_sys_sendto+0x74/0x90 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause is: 1. packet_snd() only reset skb->mac_header when sock->type is SOCK_RAW and skb->protocol is not specified as in packet_parse_headers() 2. packet_direct_xmit() doesn't reset skb->mac_header as dev_queue_xmit() In this case, skb->mac_header is 65535 when ipvlan_xmit_mode_l2() is called. So when ipvlan_xmit_mode_l2() gets mac header with eth_hdr() which use "skb->head + skb->mac_header", out-of-bound access occurs. This patch replaces eth_hdr() with skb_eth_hdr() in ipvlan_xmit_mode_l2() and reset mac header in multicast to solve this out-of-bound bug. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-01net: move from strlcpy with unused retval to strscpyWolfram Sang1-2/+2
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220830201457.7984-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23net: ipvtap - add __init/__exit annotations to module init/exit funcsMaciej Żenczykowski1-2/+2
Looks to have been left out in an oversight. Cc: Mahesh Bandewar <maheshb@google.com> Cc: Sainath Grandhi <sainath.grandhi@intel.com> Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver') Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-10ipvlan: adopt u64_stats_tEric Dumazet3-17/+17
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Add READ_ONCE() when reading rx_errs & tx_drps. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-06net: add netif_inherit_tso_max()Jakub Kicinski1-4/+2
To make later patches smaller create a helper for inheriting the TSO limitations of a lower device. The TSO in the name is not an accident, subsequent patches will replace GSO with TSO in more names. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-12net: add per-cpu storage and net->core_statsEric Dumazet1-1/+1
Before adding yet another possibly contended atomic_long_t, it is time to add per-cpu storage for existing ones: dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler Because many devices do not have to increment such counters, allocate the per-cpu storage on demand, so that dev_get_stats() does not have to spend considerable time folding zero counters. Note that some drivers have abused these counters which were supposed to be only used by core networking stack. v4: should use per_cpu_ptr() in dev_get_stats() (Jakub) v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo) v2: add a missing include (reported by kernel test robot <lkp@intel.com>) Change in netdev_core_stats_alloc() (Jakub) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-02ipvlan: Remove redundant if statementsXu Wang2-4/+2
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net: annotate accesses to dev->gso_max_segsEric Dumazet1-2/+2
dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net: annotate accesses to dev->gso_max_sizeEric Dumazet1-2/+2
dev->gso_max_size is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: ipvtap: fix template string argument of device_create() callJean Sacren1-1/+1
The last argument of device_create() call should be a template string. The tap_name variable should be the argument to the string, but not the argument of the call itself. We should add the template string and turn tap_name into its argument. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-02net: use eth_hw_addr_set() instead of ether_addr_copy()Jakub Kicinski1-1/+1
Convert from ether_addr_copy() to eth_hw_addr_set(): @@ expression dev, np; @@ - ether_addr_copy(dev->dev_addr, np) + eth_hw_addr_set(dev, np) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-02net: use eth_hw_addr_set()Jakub Kicinski1-1/+1
Convert sw drivers from memcpy(... ETH_ADDR) to eth_hw_addr_set(): @@ expression dev, np; @@ - memcpy(dev->dev_addr, np, ETH_ALEN) + eth_hw_addr_set(dev, np) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-30ipvlan: Add handling of NETDEV_UP eventsDi Zhu1-0/+1
When an ipvlan device is created on a bond device, the link state of the ipvlan device may be abnormal. This is because bonding device allows to add physical network card device in the down state and so NETDEV_CHANGE event will not be notified to other listeners, so ipvlan has no chance to update its link status. The following steps can cause such problems: 1) bond0 is down 2) ip link add link bond0 name ipvlan type ipvlan mode l2 3) echo +enp2s7 >/sys/class/net/bond0/bonding/slaves 4) ip link set bond0 up After these steps, use ip link command, we found ipvlan has NO-CARRIER: ipvlan@bond0: <NO-CARRIER, BROADCAST,MULTICAST,UP,M-DOWN> mtu ...> We can deal with this problem like VLAN: Add handling of NETDEV_UP events. If we receive NETDEV_UP event, we will update the link status of the ipvlan. Signed-off-by: Di Zhu <zhudi21@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-29ipvlan: remove h from printk format specifierTom Rix1-4/+2
This change fixes the checkpatch warning described in this commit commit cbacb5ab0aa0 ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]") Standard integer promotion is already done and %hx and %hhx is useless so do not encourage the use of %hh[xudi] or %h[xudi]. Cleanup output to use __func__ over explicit function strings. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20210124190804.1964580-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24net: don't include ethtool.h from netdevice.hJakub Kicinski1-0/+2
linux/netdevice.h is included in very many places, touching any of its dependecies causes large incremental builds. Drop the linux/ethtool.h include, linux/netdevice.h just needs a forward declaration of struct ethtool_ops. Fix all the places which made use of this implicit include. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-08-25ipvlan: advertise link netns via netlinkTaehee Yoo1-0/+8
Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is added to rtnetlink messages. Test commands: ip netns add nst ip link add dummy0 type dummy ip link add ipvlan0 link dummy0 type ipvlan ip link set ipvlan0 netns nst ip netns exec nst ip link show ipvlan0 Result: ---Before--- 6: ipvlan0@if5: <BROADCAST,MULTICAST> ... link/ether 82:3a:78:ab:60:50 brd ff:ff:ff:ff:ff:ff ---After--- 12: ipvlan0@if11: <BROADCAST,MULTICAST> ... link/ether 42:b1:ad:57:4e:27 brd ff:ff:ff:ff:ff:ff link-netnsid 0 ~~~~~~~~~~~~~~ Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17ipvlan: fix device featuresMahesh Bandewar1-5/+22
Processing NETDEV_FEAT_CHANGE causes IPvlan links to lose NETIF_F_LLTX feature because of the incorrect handling of features in ipvlan_fix_features(). --before-- lpaa10:~# ethtool -k ipvl0 | grep tx-lockless tx-lockless: on [fixed] lpaa10:~# ethtool -K ipvl0 tso off Cannot change tcp-segmentation-offload Actual changes: vlan-challenged: off [fixed] tx-lockless: off [fixed] lpaa10:~# ethtool -k ipvl0 | grep tx-lockless tx-lockless: off [fixed] lpaa10:~# --after-- lpaa10:~# ethtool -k ipvl0 | grep tx-lockless tx-lockless: on [fixed] lpaa10:~# ethtool -K ipvl0 tso off Cannot change tcp-segmentation-offload Could not change any device features lpaa10:~# ethtool -k ipvl0 | grep tx-lockless tx-lockless: on [fixed] lpaa10:~# Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Mahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: partially revert dynamic lockdep key changesCong Wang1-0/+2
This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()Eric Dumazet1-1/+1
Commit e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog") added a cond_resched_rcu() in a loop using rcu protection to iterate over slaves. This is breaking rcu rules, so lets instead use cond_resched() at a point we can reschedule Fixes: e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10ipvlan: add cond_resched_rcu() while processing muticast backlogMahesh Bandewar1-0/+1
If there are substantial number of slaves created as simulated by Syzbot, the backlog processing could take much longer and result into the issue found in the Syzbot report. INFO: rcu_sched detected stalls on CPUs/tasks: (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752) All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root ->qsmask 0x0 syz-executor.1 R running task on cpu 1 10984 11210 3866 0x30020008 179034491270 Call Trace: <IRQ> [<ffffffff81497163>] _sched_show_task kernel/sched/core.c:8063 [inline] [<ffffffff81497163>] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030 [<ffffffff8146a91b>] sched_show_task+0xb/0x10 kernel/sched/core.c:8073 [<ffffffff815c931b>] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline] [<ffffffff815c931b>] check_cpu_stall kernel/rcu/tree.c:1695 [inline] [<ffffffff815c931b>] __rcu_pending kernel/rcu/tree.c:3478 [inline] [<ffffffff815c931b>] rcu_pending kernel/rcu/tree.c:3540 [inline] [<ffffffff815c931b>] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876 [<ffffffff815e3962>] update_process_times+0x32/0x80 kernel/time/timer.c:1635 [<ffffffff816164f0>] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161 [<ffffffff81616ae4>] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193 [<ffffffff815e75f7>] __run_hrtimer kernel/time/hrtimer.c:1393 [inline] [<ffffffff815e75f7>] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455 [<ffffffff815e90ea>] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513 [<ffffffff844050f4>] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline] [<ffffffff844050f4>] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056 [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778 RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153 RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12 RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000 RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0 RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273 R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8 R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0 [<ffffffff8101460e>] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline] [<ffffffff8101460e>] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240 [<ffffffff840d78ca>] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006 [<ffffffff84023439>] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482 [<ffffffff840211c8>] dst_input include/net/dst.h:449 [inline] [<ffffffff840211c8>] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78 [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:292 [inline] [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:286 [inline] [<ffffffff840214de>] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278 [<ffffffff83a29efa>] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303 [<ffffffff83a2a15c>] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417 [<ffffffff83a2f536>] process_backlog+0x216/0x6c0 net/core/dev.c:6243 [<ffffffff83a30d1b>] napi_poll net/core/dev.c:6680 [inline] [<ffffffff83a30d1b>] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748 [<ffffffff846002c8>] __do_softirq+0x2c8/0x99a kernel/softirq.c:317 [<ffffffff813e656a>] invoke_softirq kernel/softirq.c:399 [inline] [<ffffffff813e656a>] irq_exit+0x16a/0x1a0 kernel/softirq.c:439 [<ffffffff84405115>] exiting_irq arch/x86/include/asm/apic.h:561 [inline] [<ffffffff84405115>] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058 [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778 </IRQ> RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102 RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12 RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000 RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005 RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000 R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000 [<ffffffff816236d1>] do_futex+0x151/0x1d50 kernel/futex.c:3548 [<ffffffff816260f0>] C_SYSC_futex kernel/futex_compat.c:201 [inline] [<ffffffff816260f0>] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175 [<ffffffff8101da17>] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline] [<ffffffff8101da17>] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415 [<ffffffff84401a9b>] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7f23c69 RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0 RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1 rcu_sched R running task on cpu 1 13048 8 2 0x90000000 179099587640 Call Trace: [<ffffffff8147321f>] context_switch+0x60f/0xa60 kernel/sched/core.c:3209 [<ffffffff8100095a>] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934 [<ffffffff810021df>] schedule+0x8f/0x1b0 kernel/sched/core.c:4011 [<ffffffff8101116d>] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803 [<ffffffff815c13f1>] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327 [<ffffffff8144b318>] kthread+0x348/0x420 kernel/kthread.c:246 [<ffffffff84400266>] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393 Fixes: ba35f8588f47 (“ipvlan: Defer multicast / broadcast processing to a work-queue”) Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10ipvlan: don't deref eth hdr before checking it's setMahesh Bandewar1-8/+10
IPvlan in L3 mode discards outbound multicast packets but performs the check before ensuring the ether-header is set or not. This is an error that Eric found through code browsing. Fixes: 2ad7bf363841 (“ipvlan: Initial check-in of the IPVLAN driver.”) Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09ipvlan: do not add hardware address of master to its unicast filter listJiri Wiesner1-4/+1
There is a problem when ipvlan slaves are created on a master device that is a vmxnet3 device (ipvlan in VMware guests). The vmxnet3 driver does not support unicast address filtering. When an ipvlan device is brought up in ipvlan_open(), the ipvlan driver calls dev_uc_add() to add the hardware address of the vmxnet3 master device to the unicast address list of the master device, phy_dev->uc. This inevitably leads to the vmxnet3 master device being forced into promiscuous mode by __dev_set_rx_mode(). Promiscuous mode is switched on the master despite the fact that there is still only one hardware address that the master device should use for filtering in order for the ipvlan device to be able to receive packets. The comment above struct net_device describes the uc_promisc member as a "counter, that indicates, that promiscuous mode has been enabled due to the need to listen to additional unicast addresses in a device that does not implement ndo_set_rx_mode()". Moreover, the design of ipvlan guarantees that only the hardware address of a master device, phy_dev->dev_addr, will be used to transmit and receive all packets from its ipvlan slaves. Thus, the unicast address list of the master device should not be modified by ipvlan_open() and ipvlan_stop() in order to make ipvlan a workable option on masters that do not support unicast address filtering. Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver") Reported-by: Per Sundstrom <per.sundstrom@redqube.se> Signed-off-by: Jiri Wiesner <jwiesner@suse.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-2/+0
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25net: core: add generic lockdep keysTaehee Yoo1-2/+0
Some interface types could be nested. (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..) These interface types should set lockdep class because, without lockdep class key, lockdep always warn about unexisting circular locking. In the current code, these interfaces have their own lockdep class keys and these manage itself. So that there are so many duplicate code around the /driver/net and /net/. This patch adds new generic lockdep keys and some helper functions for it. This patch does below changes. a) Add lockdep class keys in struct net_device - qdisc_running, xmit, addr_list, qdisc_busylock - these keys are used as dynamic lockdep key. b) When net_device is being allocated, lockdep keys are registered. - alloc_netdev_mqs() c) When net_device is being free'd llockdep keys are unregistered. - free_netdev() d) Add generic lockdep key helper function - netdev_register_lockdep_key() - netdev_unregister_lockdep_key() - netdev_update_lockdep_key() e) Remove unnecessary generic lockdep macro and functions f) Remove unnecessary lockdep code of each interfaces. After this patch, each interface modules don't need to maintain their lockdep keys. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-11ipvlan: consolidate TSO flags using NETIF_F_ALL_TSOMahesh Bandewar1-2/+2
This will ensure that any new TSO related flags added (which would be part of ALL_TSO mask and IPvlan driver doesn't need to update every time new flag gets added. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-17ipvlan: set hw_enc_features like macvlanBill Sommerfeld1-0/+1
Allow encapsulated packets sent to tunnels layered over ipvlan to use offloads rather than forcing SW fallbacks. Since commit f21e5077010acda73a60 ("macvlan: add offload features for encapsulation"), macvlan has set dev->hw_enc_features to include everything in dev->features; do likewise in ipvlan. Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+1
Pull networking fixes from David Miller: 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn. 2) Read SFP eeprom in max 16 byte increments to avoid problems with some SFP modules, from Russell King. 3) Fix UDP socket lookup wrt. VRF, from Tim Beale. 4) Handle route invalidation properly in s390 qeth driver, from Julian Wiedmann. 5) Memory leak on unload in RDS, from Zhu Yanjun. 6) sctp_process_init leak, from Neil HOrman. 7) Fix fib_rules rule insertion semantic change that broke Android, from Hangbin Liu. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits) pktgen: do not sleep with the thread lock held. net: mvpp2: Use strscpy to handle stat strings net: rds: fix memory leak in rds_ib_flush_mr_pool ipv6: fix EFAULT on sendto with icmpv6 and hdrincl ipv6: use READ_ONCE() for inet->hdrincl as in ipv4 Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied" net: aquantia: fix wol configuration not applied sometimes ethtool: fix potential userspace buffer overflow Fix memory leak in sctp_process_init net: rds: fix memory leak when unload rds_rdma ipv6: fix the check before getting the cookie in rt6_get_cookie ipv4: not do cache for local delivery if bc_forwarding is enabled s390/qeth: handle error when updating TX queue count s390/qeth: fix VLAN attribute in bridge_hostnotify udev event s390/qeth: check dst entry before use s390/qeth: handle limited IPv4 broadcast in L3 TX path net: fix indirect calls helpers for ptype list hooks. net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set udp: only choose unbound UDP socket for multicast when not in a VRF net/tls: replace the sleeping lock around RX resync with a bit lock ...
2019-06-05net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is setMiaohe Lin1-1/+1
There's some NICs, such as hinic, with NETIF_F_IP_CSUM and NETIF_F_TSO on but NETIF_F_HW_CSUM off. And ipvlan device features will be NETIF_F_TSO on with NETIF_F_IP_CSUM and NETIF_F_IP_CSUM both off as IPVLAN_FEATURES only care about NETIF_F_HW_CSUM. So TSO will be disabled in netdev_fix_features. For example: Features for enp129s0f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on Fixes: a188222b6ed2 ("net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner4-23/+4
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner1-0/+1
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier for more missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have MODULE_LICENCE("GPL*") inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+4
Three conflicts, one of which, for marvell10g.c is non-trivial and requires some follow-up from Heiner or someone else. The issue is that Heiner converted the marvell10g driver over to use the generic c45 code as much as possible. However, in 'net' a bug fix appeared which makes sure that a new local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0 is cleared. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22ipvlan: disallow userns cap_net_admin to change global mode/flagsDaniel Borkmann1-0/+4
When running Docker with userns isolation e.g. --userns-remap="default" and spawning up some containers with CAP_NET_ADMIN under this realm, I noticed that link changes on ipvlan slave device inside that container can affect all devices from this ipvlan group which are in other net namespaces where the container should have no permission to make changes to, such as the init netns, for example. This effectively allows to undo ipvlan private mode and switch globally to bridge mode where slaves can communicate directly without going through hostns, or it allows to switch between global operation mode (l2/l3/l3s) for everyone bound to the given ipvlan master device. libnetwork plugin here is creating an ipvlan master and ipvlan slave in hostns and a slave each that is moved into the container's netns upon creation event. * In hostns: # ip -d a [...] 8: cilium_host@bond0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.0.1/32 scope link cilium_host valid_lft forever preferred_lft forever [...] * Spawn container & change ipvlan mode setting inside of it: # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf 9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c # docker exec -ti client ip -d a [...] 10: cilium0@if4: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0 valid_lft forever preferred_lft forever # docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2 # docker exec -ti client ip -d a [...] 10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0 valid_lft forever preferred_lft forever * In hostns (mode switched to l2): # ip -d a [...] 8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.0.1/32 scope link cilium_host valid_lft forever preferred_lft forever [...] Same l3 -> l2 switch would also happen by creating another slave inside the container's network namespace when specifying the existing cilium0 link to derive the actual (bond0) master: # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2 # docker exec -ti client ip -d a [...] 2: cilium1@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0 valid_lft forever preferred_lft forever * In hostns: # ip -d a [...] 8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.41.0.1/32 scope link cilium_host valid_lft forever preferred_lft forever [...] One way to mitigate it is to check CAP_NET_ADMIN permissions of the ipvlan master device's ns, and only then allow to change mode or flags for all devices bound to it. Above two cases are then disallowed after the patch. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+2
An ipvlan bug fix in 'net' conflicted with the abstraction away of the IPV6 specific support in 'net-next'. Similarly, a bug fix for mlx5 in 'net' conflicted with the flow action conversion in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08ipvlan: decouple l3s mode dependencies from other modesDaniel Borkmann5-206/+283
Right now ipvlan has a hard dependency on CONFIG_NETFILTER and otherwise it cannot be built. However, the only ipvlan operation mode that actually depends on netfilter is l3s, everything else is independent of it. Break this hard dependency such that users are able to use ipvlan l3 mode on systems where netfilter is not compiled in. Therefore, this adds a hidden CONFIG_IPVLAN_L3S bool which is defaulting to y when CONFIG_NETFILTER is set in order to retain existing behavior for l3s. All l3s related code is refactored into ipvlan_l3s.c that is compiled in when enabled. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-31ipvlan, l3mdev: fix broken l3s mode wrt local routesDaniel Borkmann1-3/+3
While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14net: ipvlan: Issue NETDEV_PRE_CHANGEADDRPetr Machata1-0/+14
A NETDEV_CHANGEADDR event implies a change of address of each of the IPVLANs of this IPVLAN device. Therefore propagate NETDEV_PRE_CHANGEADDR to all the IPVLANs. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-10ipvlan: Remove a useless comparisonYueHaibing1-1/+1
Fix following gcc warning: drivers/net/ipvlan/ipvlan_main.c:543:12: warning: comparison is always false due to limited range of data type [-Wtype-limits] 'mode' is a u16 variable, IPVLAN_MODE_L2 is zero, the comparison is always false Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07net: core: dev: Add extack argument to dev_change_flags()Petr Machata1-4/+8
In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. One prominent API through which the notification is invoked is dev_change_flags(). Therefore extend dev_change_flags() with and extra extack argument and update all users. Most of the calls end up just encoding NULL, but several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available. Since the function declaration line is changed anyway, name the other function arguments to placate checkpatch. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07net: ipvlan: ipvlan_set_port_mode(): Add an extack argumentPetr Machata1-3/+4
A follow-up patch will extend dev_change_flags() with an extack argument. Extend ipvlan_set_port_mode() to have that argument available for the conversion. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>