summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-07-15s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1]Colin Ian King1-1/+1
Currently array jit->seen_reg[r1] is being accessed before the range checking of index r1. The range changing on r1 should be performed first since it will avoid any potential out-of-range accesses on the array seen_reg[] and also it is more optimal to perform checks on r1 before fetching data from the array. Fix this by swapping the order of the checks before the array access. Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/20210715125712.24690-1-colin.king@canonical.com
2021-07-13xdp, net: Fix use-after-free in bpf_xdp_link_releaseXuan Zhuo1-4/+10
The problem occurs between dev_get_by_index() and dev_xdp_attach_link(). At this point, dev_xdp_uninstall() is called. Then xdp link will not be detached automatically when dev is released. But link->dev already points to dev, when xdp link is released, dev will still be accessed, but dev has been released. dev_get_by_index() | link->dev = dev | | rtnl_lock() | unregister_netdevice_many() | dev_xdp_uninstall() | rtnl_unlock() rtnl_lock(); | dev_xdp_attach_link() | rtnl_unlock(); | | netdev_run_todo() // dev released bpf_xdp_link_release() | /* access dev. | use-after-free */ | [ 45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0 [ 45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732 [ 45.968297] [ 45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22 [ 45.969222] Hardware name: linux,dummy-virt (DT) [ 45.969795] Call trace: [ 45.970106] dump_backtrace+0x0/0x4c8 [ 45.970564] show_stack+0x30/0x40 [ 45.970981] dump_stack_lvl+0x120/0x18c [ 45.971470] print_address_description.constprop.0+0x74/0x30c [ 45.972182] kasan_report+0x1e8/0x200 [ 45.972659] __asan_report_load8_noabort+0x2c/0x50 [ 45.973273] bpf_xdp_link_release+0x3b8/0x3d0 [ 45.973834] bpf_link_free+0xd0/0x188 [ 45.974315] bpf_link_put+0x1d0/0x218 [ 45.974790] bpf_link_release+0x3c/0x58 [ 45.975291] __fput+0x20c/0x7e8 [ 45.975706] ____fput+0x24/0x30 [ 45.976117] task_work_run+0x104/0x258 [ 45.976609] do_notify_resume+0x894/0xaf8 [ 45.977121] work_pending+0xc/0x328 [ 45.977575] [ 45.977775] The buggy address belongs to the page: [ 45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998 [ 45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff) [ 45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000 [ 45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 45.982259] page dumped because: kasan: bad access detected [ 45.982948] [ 45.983153] Memory state around the buggy address: [ 45.983753] ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 45.984645] ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.986419] ^ [ 45.987112] ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.988006] ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.988895] ================================================================== [ 45.989773] Disabling lock debugging due to kernel taint [ 45.990552] Kernel panic - not syncing: panic_on_warn set ... [ 45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G B 5.13.0+ #22 [ 45.991929] Hardware name: linux,dummy-virt (DT) [ 45.992448] Call trace: [ 45.992753] dump_backtrace+0x0/0x4c8 [ 45.993208] show_stack+0x30/0x40 [ 45.993627] dump_stack_lvl+0x120/0x18c [ 45.994113] dump_stack+0x1c/0x34 [ 45.994530] panic+0x3a4/0x7d8 [ 45.994930] end_report+0x194/0x198 [ 45.995380] kasan_report+0x134/0x200 [ 45.995850] __asan_report_load8_noabort+0x2c/0x50 [ 45.996453] bpf_xdp_link_release+0x3b8/0x3d0 [ 45.997007] bpf_link_free+0xd0/0x188 [ 45.997474] bpf_link_put+0x1d0/0x218 [ 45.997942] bpf_link_release+0x3c/0x58 [ 45.998429] __fput+0x20c/0x7e8 [ 45.998833] ____fput+0x24/0x30 [ 45.999247] task_work_run+0x104/0x258 [ 45.999731] do_notify_resume+0x894/0xaf8 [ 46.000236] work_pending+0xc/0x328 [ 46.000697] SMP: stopping secondary CPUs [ 46.001226] Dumping ftrace buffer: [ 46.001663] (ftrace buffer empty) [ 46.002110] Kernel Offset: disabled [ 46.002545] CPU features: 0x00000001,23202c00 [ 46.003080] Memory Limit: none Fixes: aa8d3a716b59db6c ("bpf, xdp: Add bpf_link-based XDP attachment API") Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
2021-07-13bpf: Fix tail_call_reachable rejection for interpreter when jit failedDaniel Borkmann1-0/+2
During testing of f263a81451c1 ("bpf: Track subprog poke descriptors correctly and fix use-after-free") under various failure conditions, for example, when jit_subprogs() fails and tries to clean up the program to be run under the interpreter, we ran into the following freeze: [...] #127/8 tailcall_bpf2bpf_3:FAIL [...] [ 92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20 [ 92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682 [ 92.043707] [ 92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G O 5.13.0-53301-ge6c08cb33a30-dirty #87 [ 92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 92.046785] Call Trace: [ 92.047171] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.047773] ? __bpf_prog_run_args32+0x8b/0xb0 [ 92.048389] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.049019] ? ktime_get+0x117/0x130 [...] // few hundred [similar] lines more [ 92.659025] ? ktime_get+0x117/0x130 [ 92.659845] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.660738] ? __bpf_prog_run_args32+0x8b/0xb0 [ 92.661528] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.662378] ? print_usage_bug+0x50/0x50 [ 92.663221] ? print_usage_bug+0x50/0x50 [ 92.664077] ? bpf_ksym_find+0x9c/0xe0 [ 92.664887] ? ktime_get+0x117/0x130 [ 92.665624] ? kernel_text_address+0xf5/0x100 [ 92.666529] ? __kernel_text_address+0xe/0x30 [ 92.667725] ? unwind_get_return_address+0x2f/0x50 [ 92.668854] ? ___bpf_prog_run+0x15d4/0x2e20 [ 92.670185] ? ktime_get+0x117/0x130 [ 92.671130] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.672020] ? __bpf_prog_run_args32+0x8b/0xb0 [ 92.672860] ? __bpf_prog_run_args64+0xc0/0xc0 [ 92.675159] ? ktime_get+0x117/0x130 [ 92.677074] ? lock_is_held_type+0xd5/0x130 [ 92.678662] ? ___bpf_prog_run+0x15d4/0x2e20 [ 92.680046] ? ktime_get+0x117/0x130 [ 92.681285] ? __bpf_prog_run32+0x6b/0x90 [ 92.682601] ? __bpf_prog_run64+0x90/0x90 [ 92.683636] ? lock_downgrade+0x370/0x370 [ 92.684647] ? mark_held_locks+0x44/0x90 [ 92.685652] ? ktime_get+0x117/0x130 [ 92.686752] ? lockdep_hardirqs_on+0x79/0x100 [ 92.688004] ? ktime_get+0x117/0x130 [ 92.688573] ? __cant_migrate+0x2b/0x80 [ 92.689192] ? bpf_test_run+0x2f4/0x510 [ 92.689869] ? bpf_test_timer_continue+0x1c0/0x1c0 [ 92.690856] ? rcu_read_lock_bh_held+0x90/0x90 [ 92.691506] ? __kasan_slab_alloc+0x61/0x80 [ 92.692128] ? eth_type_trans+0x128/0x240 [ 92.692737] ? __build_skb+0x46/0x50 [ 92.693252] ? bpf_prog_test_run_skb+0x65e/0xc50 [ 92.693954] ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0 [ 92.694639] ? __fget_light+0xa1/0x100 [ 92.695162] ? bpf_prog_inc+0x23/0x30 [ 92.695685] ? __sys_bpf+0xb40/0x2c80 [ 92.696324] ? bpf_link_get_from_fd+0x90/0x90 [ 92.697150] ? mark_held_locks+0x24/0x90 [ 92.698007] ? lockdep_hardirqs_on_prepare+0x124/0x220 [ 92.699045] ? finish_task_switch+0xe6/0x370 [ 92.700072] ? lockdep_hardirqs_on+0x79/0x100 [ 92.701233] ? finish_task_switch+0x11d/0x370 [ 92.702264] ? __switch_to+0x2c0/0x740 [ 92.703148] ? mark_held_locks+0x24/0x90 [ 92.704155] ? __x64_sys_bpf+0x45/0x50 [ 92.705146] ? do_syscall_64+0x35/0x80 [ 92.706953] ? entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Turns out that the program rejection from e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable is never true. Commit ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") added a tracker into check_max_stack_depth() which propagates the tail_call_reachable condition throughout the subprograms. This info is then assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable is used. func[0]->aux->tail_call_reachable which represents the main program's information did not propagate this to the outer env->prog->aux, though. Add this propagation into check_max_stack_depth() where it needs to belong so that the check can be done reliably. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Co-developed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
2021-07-12bpf, test: fix NULL pointer dereference on invalid expected_attach_typeXuan Zhuo1-0/+3
These two types of XDP progs (BPF_XDP_DEVMAP, BPF_XDP_CPUMAP) will not be executed directly in the driver, therefore we should also not directly run them from here. To run in these two situations, there must be further preparations done, otherwise these may cause a kernel panic. For more details, see also dev_xdp_attach(). [ 46.982479] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 46.984295] #PF: supervisor read access in kernel mode [ 46.985777] #PF: error_code(0x0000) - not-present page [ 46.987227] PGD 800000010dca4067 P4D 800000010dca4067 PUD 10dca6067 PMD 0 [ 46.989201] Oops: 0000 [#1] SMP PTI [ 46.990304] CPU: 7 PID: 562 Comm: a.out Not tainted 5.13.0+ #44 [ 46.992001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/24 [ 46.995113] RIP: 0010:___bpf_prog_run+0x17b/0x1710 [ 46.996586] Code: 49 03 14 cc e8 76 f6 fe ff e9 ad fe ff ff 0f b6 43 01 48 0f bf 4b 02 48 83 c3 08 89 c2 83 e0 0f c0 ea 04 02 [ 47.001562] RSP: 0018:ffffc900005afc58 EFLAGS: 00010246 [ 47.003115] RAX: 0000000000000000 RBX: ffffc9000023f068 RCX: 0000000000000000 [ 47.005163] RDX: 0000000000000000 RSI: 0000000000000079 RDI: ffffc900005afc98 [ 47.007135] RBP: 0000000000000000 R08: ffffc9000023f048 R09: c0000000ffffdfff [ 47.009171] R10: 0000000000000001 R11: ffffc900005afb40 R12: ffffc900005afc98 [ 47.011172] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff825258a8 [ 47.013244] FS: 00007f04a5207580(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 [ 47.015705] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 47.017475] CR2: 0000000000000000 CR3: 0000000100182005 CR4: 0000000000770ee0 [ 47.019558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 47.021595] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 47.023574] PKRU: 55555554 [ 47.024571] Call Trace: [ 47.025424] __bpf_prog_run32+0x32/0x50 [ 47.026296] ? printk+0x53/0x6a [ 47.027066] ? ktime_get+0x39/0x90 [ 47.027895] bpf_test_run.cold.28+0x23/0x123 [ 47.028866] ? printk+0x53/0x6a [ 47.029630] bpf_prog_test_run_xdp+0x149/0x1d0 [ 47.030649] __sys_bpf+0x1305/0x23d0 [ 47.031482] __x64_sys_bpf+0x17/0x20 [ 47.032316] do_syscall_64+0x3a/0x80 [ 47.033165] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 47.034254] RIP: 0033:0x7f04a51364dd [ 47.035133] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 48 [ 47.038768] RSP: 002b:00007fff8f9fc518 EFLAGS: 00000213 ORIG_RAX: 0000000000000141 [ 47.040344] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f04a51364dd [ 47.041749] RDX: 0000000000000048 RSI: 0000000020002a80 RDI: 000000000000000a [ 47.043171] RBP: 00007fff8f9fc530 R08: 0000000002049300 R09: 0000000020000100 [ 47.044626] R10: 0000000000000004 R11: 0000000000000213 R12: 0000000000401070 [ 47.046088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 47.047579] Modules linked in: [ 47.048318] CR2: 0000000000000000 [ 47.049120] ---[ end trace 7ad34443d5be719a ]--- [ 47.050273] RIP: 0010:___bpf_prog_run+0x17b/0x1710 [ 47.051343] Code: 49 03 14 cc e8 76 f6 fe ff e9 ad fe ff ff 0f b6 43 01 48 0f bf 4b 02 48 83 c3 08 89 c2 83 e0 0f c0 ea 04 02 [ 47.054943] RSP: 0018:ffffc900005afc58 EFLAGS: 00010246 [ 47.056068] RAX: 0000000000000000 RBX: ffffc9000023f068 RCX: 0000000000000000 [ 47.057522] RDX: 0000000000000000 RSI: 0000000000000079 RDI: ffffc900005afc98 [ 47.058961] RBP: 0000000000000000 R08: ffffc9000023f048 R09: c0000000ffffdfff [ 47.060390] R10: 0000000000000001 R11: ffffc900005afb40 R12: ffffc900005afc98 [ 47.061803] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff825258a8 [ 47.063249] FS: 00007f04a5207580(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 [ 47.065070] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 47.066307] CR2: 0000000000000000 CR3: 0000000100182005 CR4: 0000000000770ee0 [ 47.067747] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 47.069217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 47.070652] PKRU: 55555554 [ 47.071318] Kernel panic - not syncing: Fatal exception [ 47.072854] Kernel Offset: disabled [ 47.073683] ---[ end Kernel panic - not syncing: Fatal exception ]--- Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap") Fixes: fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry") Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210708080409.73525-1-xuanzhuo@linux.alibaba.com
2021-07-12doc, af_xdp: Fix bind flags option typoBaruch Siach1-3/+3
Fix XDP_ZERO_COPY flag typo since it is actually named XDP_ZEROCOPY instead as per if_xdp.h uapi header. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/1656fdf94704e9e735df0f8b97667d8f26dd098b.1625550240.git.baruch@tkos.co.il
2021-07-11net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340Marek Behún2-10/+36
It seems that we cannot differentiate 88X3310 from 88X3340 by simply looking at bit 3 of revision ID. This only works on revisions A0 and A1. On revision B0, this bit is always 1. Instead use the 3.d00d register for differentiation, since this register contains information about number of ports on the device. Fixes: 9885d016ffa9 ("net: phy: marvell10g: add separate structure for 88X3340") Signed-off-by: Marek Behún <kabel@kernel.org> Reported-by: Matteo Croce <mcroce@linux.microsoft.com> Tested-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-11dsa: fix for_each_child.cocci warningskernel test robot1-1/+3
For_each_available_child_of_node should have of_node_put() before return around line 423. Generated by: scripts/coccinelle/iterators/for_each_child.cocci CC: Alexander Lobakin <alobakin@pm.me> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10virtio_net: check virtqueue_add_sgs() return valueYunjian Wang1-1/+7
As virtqueue_add_sgs() can fail, we should check the return value. Addresses-Coverity-ID: 1464439 ("Unchecked return value") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10Merge branch 'mptcp-Connection-and-accounting-fixes'David S. Miller10-28/+68
Mat Martineau says: ==================== mptcp: Connection and accounting fixes Here are some miscellaneous fixes for MPTCP: Patch 1 modifies an MPTCP hash so it doesn't depend on one of skb->dev and skb->sk being non-NULL. Patch 2 removes an extra destructor call when rejecting a join due to port mismatch. Patches 3 and 5 more cleanly handle error conditions with MP_JOIN and syncookies, and update a related self test. Patch 4 makes sure packets that trigger a subflow TCP reset during MPTCP option header processing are correctly dropped. Patch 6 addresses a rmem accounting issue that could keep packets in subflow receive buffers longer than necessary, delaying MPTCP-level ACKs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10mptcp: properly account bulk freed memoryPaolo Abeni4-6/+18
After commit 879526030c8b ("mptcp: protect the rx path with the msk socket spinlock") the rmem currently used by a given msk is really sk_rmem_alloc - rmem_released. The safety check in mptcp_data_ready() does not take the above in due account, as a result legit incoming data is kept in subflow receive queue with no reason, delaying or blocking MPTCP-level ack generation. This change addresses the issue introducing a new helper to fetch the rmem memory and using it as needed. Additionally add a MIB counter for the exceptional event described above - the peer is misbehaving. Finally, introduce the required annotation when rmem_released is updated. Fixes: 879526030c8b ("mptcp: protect the rx path with the msk socket spinlock") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/211 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10selftests: mptcp: fix case multiple subflows limited by serverJianguo Wu1-1/+1
After patch "mptcp: fix syncookie process if mptcp can not_accept new subflow", if subflow is limited, MP_JOIN SYN is dropped, and no SYN/ACK will be replied. So in case "multiple subflows limited by server", the expected SYN/ACK number should be 1. Fixes: 00587187ad30 ("selftests: mptcp: add test cases for mptcp join tests with syn cookies") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10mptcp: avoid processing packet if a subflow resetJianguo Wu3-12/+31
If check_fully_established() causes a subflow reset, it should not continue to process the packet in tcp_data_queue(). Add a return value to mptcp_incoming_options(), and return false if a subflow has been reset, else return true. Then drop the packet in tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options() return false. Fixes: d582484726c4 ("mptcp: fix fallback for MP_JOIN subflows") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10mptcp: fix syncookie process if mptcp can not_accept new subflowJianguo Wu1-3/+3
Lots of "TCP: tcp_fin: Impossible, sk->sk_state=7" in client side when doing stress testing using wrk and webfsd. There are at least two cases may trigger this warning: 1.mptcp is in syncookie, and server recv MP_JOIN SYN request, in subflow_check_req(), the mptcp_can_accept_new_subflow() return false, so subflow_init_req_cookie_join_save() isn't called, i.e. not store the data present in the MP_JOIN syn request and the random nonce in hash table - join_entries[], but still send synack. When recv 3rd-ack, mptcp_token_join_cookie_init_state() will return false, and 3rd-ack is dropped, then if mptcp conn is closed by client, client will send a DATA_FIN and a MPTCP FIN, the DATA_FIN doesn't have MP_CAPABLE or MP_JOIN, so mptcp_subflow_init_cookie_req() will return 0, and pass the cookie check, MP_JOIN request is fallback to normal TCP. Server will send a TCP FIN if closed, in client side, when process TCP FIN, it will do reset, the code path is: tcp_data_queue()->mptcp_incoming_options() ->check_fully_established()->mptcp_subflow_reset(). mptcp_subflow_reset() will set sock state to TCP_CLOSE, so tcp_fin will hit TCP_CLOSE, and print the warning. 2.mptcp is in syncookie, and server recv 3rd-ack, in mptcp_subflow_init_cookie_req(), mptcp_can_accept_new_subflow() return false, and subflow_req->mp_join is not set to 1, so in subflow_syn_recv_sock() will not reset the MP_JOIN subflow, but fallback to normal TCP, and then the same thing happens when server will send a TCP FIN if closed. For case1, subflow_check_req() return -EPERM, then tcp_conn_request() will drop MP_JOIN SYN. For case2, let subflow_syn_recv_sock() call mptcp_can_accept_new_subflow(), and do fatal fallback, send reset. Fixes: 9466a1ccebbe ("mptcp: enable JOIN requests even if cookies are in use") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10mptcp: remove redundant req destruct in subflow_check_req()Jianguo Wu1-5/+0
In subflow_check_req(), if subflow sport is mismatch, will put msk, destroy token, and destruct req, then return -EPERM, which can be done by subflow_req_destructor() via: tcp_conn_request() |--__reqsk_free() |--subflow_req_destructor() So we should remove these redundant code, otherwise will call tcp_v4_reqsk_destructor() twice, and may double free inet_rsk(req)->ireq_opt. Fixes: 5bc56388c74f ("mptcp: add port number check for MP_JOIN") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow joinJianguo Wu1-1/+15
I did stress test with wrk[1] and webfsd[2] with the assistance of mptcp-tools[3]: Server side: ./use_mptcp.sh webfsd -4 -R /tmp/ -p 8099 Client side: ./use_mptcp.sh wrk -c 200 -d 30 -t 4 http://192.168.174.129:8099/ and got the following warning message: [ 55.552626] TCP: request_sock_subflow: Possible SYN flooding on port 8099. Sending cookies. Check SNMP counters. [ 55.553024] ------------[ cut here ]------------ [ 55.553027] WARNING: CPU: 0 PID: 10 at net/core/flow_dissector.c:984 __skb_flow_dissect+0x280/0x1650 ... [ 55.553117] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.12.0+ #18 [ 55.553121] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 [ 55.553124] RIP: 0010:__skb_flow_dissect+0x280/0x1650 ... [ 55.553133] RSP: 0018:ffffb79580087770 EFLAGS: 00010246 [ 55.553137] RAX: 0000000000000000 RBX: ffffffff8ddb58e0 RCX: ffffb79580087888 [ 55.553139] RDX: ffffffff8ddb58e0 RSI: ffff8f7e4652b600 RDI: 0000000000000000 [ 55.553141] RBP: ffffb79580087858 R08: 0000000000000000 R09: 0000000000000008 [ 55.553143] R10: 000000008c622965 R11: 00000000d3313a5b R12: ffff8f7e4652b600 [ 55.553146] R13: ffff8f7e465c9062 R14: 0000000000000000 R15: ffffb79580087888 [ 55.553149] FS: 0000000000000000(0000) GS:ffff8f7f75e00000(0000) knlGS:0000000000000000 [ 55.553152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.553154] CR2: 00007f73d1d19000 CR3: 0000000135e10004 CR4: 00000000003706f0 [ 55.553160] Call Trace: [ 55.553166] ? __sha256_final+0x67/0xd0 [ 55.553173] ? sha256+0x7e/0xa0 [ 55.553177] __skb_get_hash+0x57/0x210 [ 55.553182] subflow_init_req_cookie_join_save+0xac/0xc0 [ 55.553189] subflow_check_req+0x474/0x550 [ 55.553195] ? ip_route_output_key_hash+0x67/0x90 [ 55.553200] ? xfrm_lookup_route+0x1d/0xa0 [ 55.553207] subflow_v4_route_req+0x8e/0xd0 [ 55.553212] tcp_conn_request+0x31e/0xab0 [ 55.553218] ? selinux_socket_sock_rcv_skb+0x116/0x210 [ 55.553224] ? tcp_rcv_state_process+0x179/0x6d0 [ 55.553229] tcp_rcv_state_process+0x179/0x6d0 [ 55.553235] tcp_v4_do_rcv+0xaf/0x220 [ 55.553239] tcp_v4_rcv+0xce4/0xd80 [ 55.553243] ? ip_route_input_rcu+0x246/0x260 [ 55.553248] ip_protocol_deliver_rcu+0x35/0x1b0 [ 55.553253] ip_local_deliver_finish+0x44/0x50 [ 55.553258] ip_local_deliver+0x6c/0x110 [ 55.553262] ? ip_rcv_finish_core.isra.19+0x5a/0x400 [ 55.553267] ip_rcv+0xd1/0xe0 ... After debugging, I found in __skb_flow_dissect(), skb->dev and skb->sk are both NULL, then net is NULL, and trigger WARN_ON_ONCE(!net), actually net is always NULL in this code path, as skb->dev is set to NULL in tcp_v4_rcv(), and skb->sk is never set. Code snippet in __skb_flow_dissect() that trigger warning: 975 if (skb) { 976 if (!net) { 977 if (skb->dev) 978 net = dev_net(skb->dev); 979 else if (skb->sk) 980 net = sock_net(skb->sk); 981 } 982 } 983 984 WARN_ON_ONCE(!net); So, using seq and transport header derived hash. [1] https://github.com/wg/wrk [2] https://github.com/ourway/webfsd [3] https://github.com/pabeni/mptcp-tools Fixes: 9466a1ccebbe ("mptcp: enable JOIN requests even if cookies are in use") Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller13-62/+118
Daniel Borkmann says: ==================== pull-request: bpf 2021-07-09 The following pull-request contains BPF updates for your *net* tree. We've added 9 non-merge commits during the last 9 day(s) which contain a total of 13 files changed, 118 insertions(+), 62 deletions(-). The main changes are: 1) Fix runqslower task->state access from BPF, from SanjayKumar Jeyakumar. 2) Fix subprog poke descriptor tracking use-after-free, from John Fastabend. 3) Fix sparse complaint from prior devmap RCU conversion, from Toke Høiland-Jørgensen. 4) Fix missing va_end in bpftool JIT json dump's error path, from Gu Shengxian. 5) Fix tools/bpf install target from missing runqslower install, from Wei Li. 6) Fix xdpsock BPF sample to unload program on shared umem option, from Wang Hai. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: validate lwtstate->data before returning from skb_tunnel_info()Taehee Yoo1-1/+3
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info type without validation. lwtstate->data can have various types such as mpls_iptunnel_encap, etc and these are not compatible. So skb_tunnel_info() should validate before returning that pointer. Splat looks like: BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan] Read of size 2 at addr ffff888106ec2698 by task ping/811 CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195 Call Trace: dump_stack_lvl+0x56/0x7b print_address_description.constprop.8.cold.13+0x13/0x2ee ? vxlan_get_route+0x418/0x4b0 [vxlan] ? vxlan_get_route+0x418/0x4b0 [vxlan] kasan_report.cold.14+0x83/0xdf ? vxlan_get_route+0x418/0x4b0 [vxlan] vxlan_get_route+0x418/0x4b0 [vxlan] [ ... ] vxlan_xmit_one+0x148b/0x32b0 [vxlan] [ ... ] vxlan_xmit+0x25c5/0x4780 [vxlan] [ ... ] dev_hard_start_xmit+0x1ae/0x6e0 __dev_queue_xmit+0x1f39/0x31a0 [ ... ] neigh_xmit+0x2f9/0x940 mpls_xmit+0x911/0x1600 [mpls_iptunnel] lwtunnel_xmit+0x18f/0x450 ip_finish_output2+0x867/0x2040 [ ... ] Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: ip_tunnel: fix mtu calculation for ETHER tunnel devicesHangbin Liu1-3/+15
Commit 28e104d00281 ("net: ip_tunnel: fix mtu calculation") removed dev->hard_header_len subtraction when calculate MTU for tunnel devices as there is an overhead for device that has header_ops. But there are ETHER tunnel devices, like gre_tap or erspan, which don't have header_ops but set dev->hard_header_len during setup. This makes pkts greater than (MTU - ETH_HLEN) could not be xmited. Fix it by subtracting the ETHER tunnel devices' dev->hard_header_len for MTU calculation. Fixes: 28e104d00281 ("net: ip_tunnel: fix mtu calculation") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cacheAntoine Tenart1-0/+2
Some socket buffers allocated in the fclone cache (in __alloc_skb) can end-up in the following path[1]: napi_skb_finish __kfree_skb_defer napi_skb_cache_put The issue is napi_skb_cache_put is not fclone friendly and will put those skbuff in the skb cache to be reused later, although this cache only expects skbuff allocated from skbuff_head_cache. When this happens the skbuff is eventually freed using the wrong origin cache, and we can see traces similar to: [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0 [ 1223.950211] Modules linked in: [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474 [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014 [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0 Leading sometimes to other memory related issues. Fix this by using __kfree_skb for fclone skbuff, similar to what is done the other place __kfree_skb_defer is called. [1] At least in setups using veth pairs and tunnels. Building a kernel with KASAN we can for example see packets allocated in sk_stream_alloc_skb hit the above path and later the issue arises when the skbuff is reused. Fixes: 9243adfc311a ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing") Cc: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy pathTalal Ahmad1-0/+3
sk_wmem_schedule makes sure that sk_forward_alloc has enough bytes for charging that is going to be done by sk_mem_charge. In the transmit zerocopy path, there is sk_mem_charge but there was no call to sk_wmem_schedule. This change adds that call. Without this call to sk_wmem_schedule, sk_forward_alloc can go negetive which is a bug because sk_forward_alloc is a per-socket space that has been forward charged so this can't be negative. Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY") Signed-off-by: Talal Ahmad <talalahmad@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: send SYNACK packet with accepted fwmarkAlexander Ovechkin1-1/+1
commit e05a90ec9e16 ("net: reflect mark on tcp syn ack packets") fixed IPv4 only. This part is for the IPv6 side. Fixes: e05a90ec9e16 ("net: reflect mark on tcp syn ack packets") Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: ti: fix UAF in tlan_remove_onePavel Skripkin1-2/+1
priv is netdev private data and it cannot be used after free_netdev() call. Using priv after free_netdev() can cause UAF bug. Fix it by moving free_netdev() at the end of the function. Fixes: 1e0a8b13d355 ("tlan: cancel work at remove path") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: qcom/emac: fix UAF in emac_removePavel Skripkin1-1/+2
adpt is netdev private data and it cannot be used after free_netdev() call. Using adpt after free_netdev() can cause UAF bug. Fix it by moving free_netdev() at the end of the function. Fixes: 54e19bc74f33 ("net: qcom/emac: do not use devm on internal phy pdev") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: moxa: fix UAF in moxart_mac_probePavel Skripkin1-3/+1
In case of netdev registration failure the code path will jump to init_fail label: init_fail: netdev_err(ndev, "init failed\n"); moxart_mac_free_memory(ndev); irq_map_fail: free_netdev(ndev); return ret; So, there is no need to call free_netdev() before jumping to error handling path, since it can cause UAF or double-free bug. Fixes: 6c821bd9edc9 ("net: Add MOXA ART SoCs ethernet driver") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09bpf: Selftest to verify mixing bpf2bpf calls and tailcalls with insn patchJohn Fastabend2-10/+44
This adds some extra noise to the tailcall_bpf2bpf4 tests that will cause verify to patch insns. This then moves around subprog start/end insn index and poke descriptor insn index to ensure that verify and JIT will continue to track these correctly. If done correctly verifier should pass this program same as before and JIT should emit tail call logic. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210707223848.14580-3-john.fastabend@gmail.com
2021-07-09bpf: Track subprog poke descriptors correctly and fix use-after-freeJohn Fastabend4-40/+32
Subprograms are calling map_poke_track(), but on program release there is no hook to call map_poke_untrack(). However, on program release, the aux memory (and poke descriptor table) is freed even though we still have a reference to it in the element list of the map aux data. When we run map_poke_run(), we then end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run(): [...] [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 [...] The elements concerned are walked as follows: for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; [...] The access to size_poke_tab is a 4 byte read, verified by checking offsets in the KASAN dump: [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) The pahole output of bpf_prog_aux: struct bpf_prog_aux { [...] /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ [...] In general, subprograms do not necessarily manage their own data structures. For example, BTF func_info and linfo are just pointers to the main program structure. This allows reference counting and cleanup to be done on the latter which simplifies their management a bit. The aux->poke_tab struct, however, did not follow this logic. The initial proposed fix for this use-after-free bug further embedded poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects special; I agree, its unnecessary. The fix here removes the per subprogram poke table allocation and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke table. This way, map tracking is simplified to the main program and we do not need to manage them per subprogram. This also means, bpf_prog_free_deferred(), which unwinds the program reference counting and kfrees objects, needs to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do this, we add a pointer in the poke structure to point at the subprogram value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. The aux pointer is stable and therefore suitable for such comparison. On the jit_subprogs() error path, we omit cleaning up the poke->aux field because these are only ever referenced from the JIT side, but on error we will never make it to the JIT, so its fine to leave them dangling. Removing these pointers would complicate the error path for no reason. However, we do need to untrack all poke descriptors from the main program as otherwise they could race with the freeing of JIT memory from the subprograms. Lastly, a748c6975dea3 ("bpf: propagate poke descriptors to subprograms") had an off-by-one on the subprogram instruction index range check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'. However, subprog_end is the next subprogram's start instruction. Fixes: a748c6975dea3 ("bpf: propagate poke descriptors to subprograms") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
2021-07-09net: bcmgenet: Ensure all TX/RX queues DMAs are disabledFlorian Fainelli1-0/+6
Make sure that we disable each of the TX and RX queues in the TDMA and RDMA control registers. This is a correctness change to be symmetrical with the code that enables the TX and RX queues. Tested-by: Maxime Ripard <maxime@cerno.tech> Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09Merge branch 'ncsi-phy-link-up'David S. Miller4-5/+68
Ivan Mikhaylov says: ==================== net/ncsi: Add NCSI Intel OEM command to keep PHY link up Add NCSI Intel OEM command to keep PHY link up and prevents any channel resets during the host load on i210. Also includes dummy response handler for Intel manufacturer id. Changes from v1: 1. sparse fixes about casts 2. put it after ncsi_dev_state_probe_cis instead of ncsi_dev_state_probe_channel because sometimes channel is not ready after it 3. inl -> intel ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net/ncsi: add dummy response handler for Intel boardsIvan Mikhaylov1-1/+8
Add the dummy response handler for Intel boards to prevent incorrect handling of OEM commands. Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net/ncsi: add NCSI Intel OEM command to keep PHY upIvan Mikhaylov3-0/+56
This allows to keep PHY link up and prevents any channel resets during the host load. It is KEEP_PHY_LINK_UP option(Veto bit) in i210 datasheet which block PHY reset and power state changes. Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net/ncsi: fix restricted cast warning of sparseIvan Mikhaylov2-4/+4
Sparse reports: net/ncsi/ncsi-rsp.c:406:24: warning: cast to restricted __be32 net/ncsi/ncsi-manage.c:732:33: warning: cast to restricted __be32 net/ncsi/ncsi-manage.c:756:25: warning: cast to restricted __be32 net/ncsi/ncsi-manage.c:779:25: warning: cast to restricted __be32 Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08net: microchip: sparx5: fix kconfig warningRandy Dunlap1-0/+1
PHY_SPARX5_SERDES depends on OF so SPARX5_SWITCH should also depend on OF since 'select' does not follow any dependencies. WARNING: unmet direct dependencies detected for PHY_SPARX5_SERDES Depends on [n]: (ARCH_SPARX5 || COMPILE_TEST [=n]) && OF [=n] && HAS_IOMEM [=y] Selected by [y]: - SPARX5_SWITCH [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && NET_SWITCHDEV [=y] && HAS_IOMEM [=y] Fixes: 3cfa11bac9bb ("net: sparx5: add the basic sparx5 driver") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: linux-arm-kernel@lists.infradead.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08cxgb4: fix IRQ free race during driver unloadShahjada Abul Husain2-8/+13
IRQs are requested during driver's ndo_open() and then later freed up in disable_interrupts() during driver unload. A race exists where driver can set the CXGB4_FULL_INIT_DONE flag in ndo_open() after the disable_interrupts() in driver unload path checks it, and hence misses calling free_irq(). Fix by unregistering netdevice first and sync with driver's ndo_open(). This ensures disable_interrupts() checks the flag correctly and frees up the IRQs properly. Fixes: b37987e8db5f ("cxgb4: Disable interrupts and napi before unregistering netdev") Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com> Signed-off-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08mt76: mt7921: continue to probe driver when fw already downloadedAaron Ma1-1/+2
When reboot system, no power cycles, firmware is already downloaded, return -EIO will break driver as error: mt7921e: probe of 0000:03:00.0 failed with error -5 Skip firmware download and continue to probe. Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Fixes: 1c099ab44727c ("mt76: mt7921: add MCU support") Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08atl1c: fix Mikrotik 10/25G NIC detectionGatis Peisenieks1-0/+5
Since Mikrotik 10/25G NIC MDIO op emulation is not 100% reliable, on rare occasions it can happen that some physical functions of the NIC do not get initialized due to timeouted early MDIO op. This changes the atl1c probe on Mikrotik 10/25G NIC not to depend on MDIO op emulation. Signed-off-by: Gatis Peisenieks <gatis@mikrotik.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08ptp: Relocate lookup cookie to correct block.Jonathan Lemon1-1/+1
An earlier commit set the pps_lookup cookie, but the line was somehow added to the wrong code block. Correct this. Fixes: 8602e40fc813 ("ptp: Set lookup cookie when creating a PTP PPS source.") Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Dario Binacchi <dariobin@libero.it> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08ipv6: tcp: drop silly ICMPv6 packet too big messagesEric Dumazet2-2/+18
While TCP stack scales reasonably well, there is still one part that can be used to DDOS it. IPv6 Packet too big messages have to lookup/insert a new route, and if abused by attackers, can easily put hosts under high stress, with many cpus contending on a spinlock while one is stuck in fib6_run_gc() ip6_protocol_deliver_rcu() icmpv6_rcv() icmpv6_notify() tcp_v6_err() tcp_v6_mtu_reduced() inet6_csk_update_pmtu() ip6_rt_update_pmtu() __ip6_rt_update_pmtu() ip6_rt_cache_alloc() ip6_dst_alloc() dst_alloc() ip6_dst_gc() fib6_run_gc() spin_lock_bh() ... Some of our servers have been hit by malicious ICMPv6 packets trying to _increase_ the MTU/MSS of TCP flows. We believe these ICMPv6 packets are a result of a bug in one ISP stack, since they were blindly sent back for _every_ (small) packet sent to them. These packets are for one TCP flow: 09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 TCP stack can filter some silly requests : 1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err() 2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS. This tests happen before the IPv6 routing stack is entered, thus removing the potential contention and route exhaustion. Note that IPv6 stack was performing these checks, but too late (ie : after the route has been added, and after the potential garbage collect war) v2: fix typo caught by Martin, thanks ! v3: exports tcp_mtu_to_mss(), caught by David, thanks ! Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08skbuff: Fix build with SKB extensions disabledFlorian Fainelli1-1/+2
We will fail to build with CONFIG_SKB_EXTENSIONS disabled after 8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used skbs") since there is an unconditionally use of skb_ext_find() without an appropriate stub. Simply build the code conditionally and properly guard against both COFNIG_SKB_EXTENSIONS as well as CONFIG_NET_TC_SKB_EXT being disabled. Fixes: Fixes: 8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used skbs") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08ipmr: Fix indentation issueRoy, UjjaL1-1/+1
Fixed indentation by removing extra spaces. Signed-off-by: Roy, UjjaL <royujjal@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08sock: unlock on error in sock_setsockopt()Dan Carpenter1-2/+4
If copy_from_sockptr() then we need to unlock before returning. Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08tools: bpf: Fix error in 'make -C tools/ bpf_install'Wei Li1-5/+2
make[2]: *** No rule to make target 'install'. Stop. make[1]: *** [Makefile:122: runqslower_install] Error 2 make: *** [Makefile:116: bpf_install] Error 2 There is no rule for target 'install' in tools/bpf/runqslower/Makefile, and there is no need to install it, so just remove 'runqslower_install'. Fixes: 9c01546d26d2 ("tools/bpf: Add runqslower tool to tools/bpf") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210628030409.3459095-1-liwei391@huawei.com
2021-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller15-49/+262
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Do not refresh timeout in SYN_SENT for syn retransmissions. Add selftest for unreplied TCP connection, from Florian Westphal. 2) Fix null dereference from error path with hardware offload in nftables. 3) Remove useless nf_ct_gre_keymap_flush() from netns exit path, from Vasily Averin. 4) Missing rcu read-lock side in ctnetlink helper info dump, also from Vasily. 5) Do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry, from Ali Abdallah and Florian Westphal. 6) Add tcp_ignore_invalid_rst sysctl to allow to disable out of segment RSTs, from Ali. 7) KCSAN fix for nf_conntrack_all_lock(), from Manfred Spraul. 8) Honor NFTA_LAST_SET in nft_last. 9) Fix incorrect arithmetics when restore last_jiffies in nft_last. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07selftests: icmp_redirect: IPv6 PMTU info should be cleared after redirectHangbin Liu1-2/+3
After redirecting, it's already a new path. So the old PMTU info should be cleared. The IPv6 test "mtu exception plus redirect" should only has redirect info without old PMTU. The IPv4 test can not be changed because of legacy. Fixes: ec8105352869 ("selftests: Add redirect tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07selftests: icmp_redirect: remove from checking for IPv6 route getHangbin Liu1-1/+1
If the kernel doesn't enable option CONFIG_IPV6_SUBTREES, the RTA_SRC info will not be exported to userspace in rt6_fill_node(). And ip cmd will not print "from ::" to the route output. So remove this check. Fixes: ec8105352869 ("selftests: Add redirect tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07stmmac: platform: Fix signedness bug in stmmac_probe_config_dt()YueHaibing1-3/+5
The "plat->phy_interface" variable is an enum and in this context GCC will treat it as an unsigned int so the error handling is never triggered. Fixes: b9f0b2f634c0 ("net: stmmac: platform: fix probe for ACPI devices") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07stmmac: dwmac-loongson: Fix unsigned comparison to zeroYueHaibing1-4/+5
plat->phy_interface is unsigned integer, so the condition can't be less than zero and the warning will never printed. Fixes: 30bba69d7db4 ("stmmac: pci: Add dwmac support for Loongson") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07tools/runqslower: Use __state instead of stateSanjayKumar Jeyakumar1-1/+1
Commit 2f064a59a11f ("sched: Change task_struct::state") renamed task->state to task->__state in task_struct. Fix runqslower to use the new name of the field. Fixes: 2f064a59a11f ("sched: Change task_struct::state") Signed-off-by: SanjayKumar Jeyakumar <vjsanjay@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707052914.21473-1-vjsanjay@gmail.com
2021-07-07netfilter: uapi: refer to nfnetlink_conntrack.h, not nf_conntrack_netlink.hDuncan Roe2-3/+3
nf_conntrack_netlink.h does not exist, refer to nfnetlink_conntrack.h instead. Signed-off-by: Duncan Roe <duncan_roe@optusnet.com.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-07libbpf: Restore errno return for functions that were already returning itToke Høiland-Jørgensen1-2/+2
The update to streamline libbpf error reporting intended to change all functions to return the errno as a negative return value if LIBBPF_STRICT_DIRECT_ERRS is set. However, if the flag is *not* set, the return value changes for the two functions that were already returning a negative errno unconditionally: bpf_link__unpin() and perf_buffer__poll(). This is a user-visible API change that breaks applications; so let's revert these two functions back to unconditionally returning a negative errno value. Fixes: e9fc3ce99b34 ("libbpf: Streamline error reporting for high-level APIs") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210706122355.236082-1-toke@redhat.com
2021-07-07ipv6: fix 'disable_policy' for fwd packetsNicolas Dichtel1-1/+3
The goal of commit df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") was to have the disable_policy from ipv4 available on ipv6. However, it's not exactly the same mechanism. On IPv4, all packets coming from an interface, which has disable_policy set, bypass the policy check. For ipv6, this is done only for local packets, ie for packets destinated to an address configured on the incoming interface. Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same effect for both protocols. My first approach was to create a new kind of route cache entries, to be able to set DST_NOPOLICY without modifying routes. This would have added a lot of code. Because the local delivery path is already handled, I choose to focus on the forwarding path to minimize code churn. Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>