summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2020-08-05Merge tag 'seccomp-v5.9-rc1' of ↵Linus Torvalds2-40/+31
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "There are a bunch of clean ups and selftest improvements along with two major updates to the SECCOMP_RET_USER_NOTIF filter return: EPOLLHUP support to more easily detect the death of a monitored process, and being able to inject fds when intercepting syscalls that expect an fd-opening side-effect (needed by both container folks and Chrome). The latter continued the refactoring of __scm_install_fd() started by Christoph, and in the process found and fixed a handful of bugs in various callers. - Improved selftest coverage, timeouts, and reporting - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner) - Refactor __scm_install_fd() into __receive_fd() and fix buggy callers - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)" * tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits) selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD seccomp: Introduce addfd ioctl to seccomp user notifier fs: Expand __receive_fd() to accept existing fd pidfd: Replace open-coded receive_fd() fs: Add receive_fd() wrapper for __receive_fd() fs: Move __scm_install_fd() to __receive_fd() net/scm: Regularize compat handling of scm_detach_fds() pidfd: Add missing sock updates for pidfd_getfd() net/compat: Add missing sock updates for SCM_RIGHTS selftests/seccomp: Check ENOSYS under tracing selftests/seccomp: Refactor to use fixture variants selftests/harness: Clean up kern-doc for fixtures seccomp: Use -1 marker for end of mode 1 syscall list seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall() selftests/seccomp: Make kcmp() less required seccomp: Use pr_fmt selftests/seccomp: Improve calibration loop selftests/seccomp: use 90s as timeout selftests/seccomp: Expand benchmark to per-filter measurements ...
2020-08-04Merge tag 'sched-core-2020-08-03' of ↵Linus Torvalds1-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve uclamp performance by using a static key for the fast path - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for better power efficiency of RT tasks on battery powered devices. (The default is to maximize performance & reduce RT latencies.) - Improve utime and stime tracking accuracy, which had a fixed boundary of error, which created larger and larger relative errors as the values become larger. This is now replaced with more precise arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h. - Improve the deadline scheduler, such as making it capacity aware - Improve frequency-invariant scheduling - Misc cleanups in energy/power aware scheduling - Add sched_update_nr_running tracepoint to track changes to nr_running - Documentation additions and updates - Misc cleanups and smaller fixes * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst sched/doc: Document capacity aware scheduling sched: Document arch_scale_*_capacity() arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE Documentation/sysctl: Document uclamp sysctl knobs sched/uclamp: Add a new sysctl to control RT default boost value sched/uclamp: Fix a deadlock when enabling uclamp static key sched: Remove duplicated tick_nohz_full_enabled() check sched: Fix a typo in a comment sched/uclamp: Remove unnecessary mutex_init() arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry arch_topology, sched/core: Cleanup thermal pressure definition trace/events/sched.h: fix duplicated word linux/sched/mm.h: drop duplicated words in comments smp: Fix a potential usage of stale nr_cpus sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal sched: nohz: stop passing around unused "ticks" parameter. sched: Better document ttwu() sched: Add a tracepoint to track rq->nr_running ...
2020-08-04Merge tag 'core-rcu-2020-08-03' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ...
2020-07-31devlink: ignore -EOPNOTSUPP errors on dumpitJakub Kicinski1-6/+18
Number of .dumpit functions try to ignore -EOPNOTSUPP errors. Recent change missed that, and started reporting all errors but -EMSGSIZE back from dumps. This leads to situation like this: $ devlink dev info devlink answers: Operation not supported Dump should not report an error just because the last device to be queried could not provide an answer. To fix this and avoid similar confusion make sure we clear err properly, and not leave it set to an error if we don't terminate the iteration. Fixes: c62c2cfb801b ("net: devlink: don't ignore errors during dumpit") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31Merge branch 'for-mingo' of ↵Ingo Molnar1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull the v5.9 RCU bits from Paul E. McKenney: - Documentation updates - Miscellaneous fixes - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-29mlxsw: spectrum: Use different trap group for externally routed packetsIdo Schimmel1-0/+1
Cited commit mistakenly removed the trap group for externally routed packets (e.g., via the management interface) and grouped locally routed and externally routed packet traps under the same group, thereby subjecting them to the same policer. This can result in problems, for example, when FRR is restarted and suddenly all transient traffic is trapped to the CPU because of a default route through the management interface. Locally routed packets required to re-establish a BGP connection will never reach the CPU and the routing tables will not be re-populated. Fix this by using a different trap group for externally routed packets. Fixes: 8110668ecd9a ("mlxsw: spectrum_trap: Register layer 3 control traps") Reported-by: Alex Veber <alexve@mellanox.com> Tested-by: Alex Veber <alexve@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25dev: Defer free of skbs in flush_backlogSubash Abhinov Kasiviswanathan1-1/+1
IRQs are disabled when freeing skbs in input queue. Use the IRQ safe variant to free skbs here. Fixes: 145dd5f9c88f ("net: flush the softnet backlog in process context") Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25flow_offload: Move rhashtable inclusion to the source fileHerbert Xu1-0/+1
I noticed that touching linux/rhashtable.h causes lib/vsprintf.c to be rebuilt. This dependency came through a bogus inclusion in the file net/flow_offload.h. This patch moves it to the right place. This patch also removes a lingering rhashtable inclusion in cls_api created by the same commit. Fixes: 4e481908c51b ("flow_offload: move tc indirect block to...") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22Merge branch 'sched/urgent'Peter Zijlstra7-38/+94
2020-07-22net-sysfs: add a newline when printing 'tx_timeout' by sysfsXiongfeng Wang1-1/+1
When I cat 'tx_timeout' by sysfs, it displays as follows. It's better to add a newline for easy reading. root@syzkaller:~# cat /sys/devices/virtual/net/lo/queues/tx-0/tx_timeout 0root@syzkaller:~# Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22udp: Copy has_conns in reuseport_grow().Kuniyuki Iwashima1-0/+1
If an unconnected socket in a UDP reuseport group connect()s, has_conns is set to 1. Then, when a packet is received, udp[46]_lib_lookup2() scans all sockets in udp_hslot looking for the connected socket with the highest score. However, when the number of sockets bound to the port exceeds max_socks, reuseport_grow() resets has_conns to 0. It can cause udp[46]_lib_lookup2() to return without scanning all sockets, resulting in that packets sent to connected sockets may be distributed to unconnected sockets. Therefore, reuseport_grow() should copy has_conns. Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") CC: Willem de Bruijn <willemb@google.com> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-17rtnetlink: Fix memory(net_device) leak when ->newlink failsWeilong Chen1-1/+2
When vlan_newlink call register_vlan_dev fails, it might return error with dev->reg_state = NETREG_UNREGISTERED. The rtnl_newlink should free the memory. But currently rtnl_newlink only free the memory which state is NETREG_UNINITIALIZED. BUG: memory leak unreferenced object 0xffff8881051de000 (size 4096): comm "syz-executor139", pid 560, jiffies 4294745346 (age 32.445s) hex dump (first 32 bytes): 76 6c 61 6e 32 00 00 00 00 00 00 00 00 00 00 00 vlan2........... 00 45 28 03 81 88 ff ff 00 00 00 00 00 00 00 00 .E(............. backtrace: [<0000000047527e31>] kmalloc_node include/linux/slab.h:578 [inline] [<0000000047527e31>] kvmalloc_node+0x33/0xd0 mm/util.c:574 [<000000002b59e3bc>] kvmalloc include/linux/mm.h:753 [inline] [<000000002b59e3bc>] kvzalloc include/linux/mm.h:761 [inline] [<000000002b59e3bc>] alloc_netdev_mqs+0x83/0xd90 net/core/dev.c:9929 [<000000006076752a>] rtnl_create_link+0x2c0/0xa20 net/core/rtnetlink.c:3067 [<00000000572b3be5>] __rtnl_newlink+0xc9c/0x1330 net/core/rtnetlink.c:3329 [<00000000e84ea553>] rtnl_newlink+0x66/0x90 net/core/rtnetlink.c:3397 [<0000000052c7c0a9>] rtnetlink_rcv_msg+0x540/0x990 net/core/rtnetlink.c:5460 [<000000004b5cb379>] netlink_rcv_skb+0x12b/0x3a0 net/netlink/af_netlink.c:2469 [<00000000c71c20d3>] netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] [<00000000c71c20d3>] netlink_unicast+0x4c6/0x690 net/netlink/af_netlink.c:1329 [<00000000cca72fa9>] netlink_sendmsg+0x735/0xcc0 net/netlink/af_netlink.c:1918 [<000000009221ebf7>] sock_sendmsg_nosec net/socket.c:652 [inline] [<000000009221ebf7>] sock_sendmsg+0x109/0x140 net/socket.c:672 [<000000001c30ffe4>] ____sys_sendmsg+0x5f5/0x780 net/socket.c:2352 [<00000000b71ca6f3>] ___sys_sendmsg+0x11d/0x1a0 net/socket.c:2406 [<0000000007297384>] __sys_sendmsg+0xeb/0x1b0 net/socket.c:2439 [<000000000eb29b11>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359 [<000000006839b4d0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13fs: Add receive_fd() wrapper for __receive_fd()Kees Cook1-1/+1
For both pidfd and seccomp, the __user pointer is not used. Update __receive_fd() to make writing to ufd optional via a NULL check. However, for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT can be returned to avoid changing the SCM_RIGHTS interface behavior. Add new wrapper receive_fd() for pidfd and seccomp that does not use the ufd argument. For the new helper, the allocated fd needs to be returned on success. Update the existing callers to handle it. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-13fs: Move __scm_install_fd() to __receive_fd()Kees Cook1-26/+1
In preparation for users of the "install a received file" logic outside of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper named receive_fd_user(), as future patches will change the interface to __receive_fd(). Additionally add a comment to fd_install() as a counterpoint to how __receive_fd() interacts with fput(). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Dmitry Kadashev <dkadashev@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Ido Schimmel <idosch@idosch.org> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: linux-fsdevel@vger.kernel.org Cc: netdev@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-13net/scm: Regularize compat handling of scm_detach_fds()Kees Cook1-16/+11
Duplicate the cleanups from commit 2618d530dd8b ("net/scm: cleanup scm_detach_fds") into the compat code. Replace open-coded __receive_sock() with a call to the helper. Move the check added in commit 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control") to before the compat call, even though it should be impossible for an in-kernel call to also be compat. Correct the int "flags" argument to unsigned int to match fd_install() and similar APIs. Regularize any remaining differences, including a whitespace issue, a checkpatch warning, and add the check from commit 6900317f5eff ("net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds") which fixed an overflow unique to 64-bit. To avoid confusion when comparing the compat handler to the native handler, just include the same check in the compat handler. Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-13net/compat: Add missing sock updates for SCM_RIGHTSKees Cook1-0/+21
Add missed sock updates to compat path via a new helper, which will be used more in coming patches. (The net/core/scm.c code is left as-is here to assist with -stable backports for the compat path.) Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Fixes: 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly") Fixes: d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds6-37/+93
Pull networking fixes from David Miller: 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking BPF programs, from Maciej Żenczykowski. 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu Mariappan. 3) Slay memory leak in nl80211 bss color attribute parsing code, from Luca Coelho. 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin. 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric Dumazet. 6) xsk code dips too deeply into DMA mapping implementation internals. Add dma_need_sync and use it. From Christoph Hellwig 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko. 8) Check for disallowed attributes when loading flow dissector BPF programs. From Lorenz Bauer. 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from Jason A. Donenfeld. 10) Don't advertise checksum offload on ipa devices that don't support it. From Alex Elder. 11) Resolve several issues in TCP MD5 signature support. Missing memory barriers, bogus options emitted when using syncookies, and failure to allow md5 key changes in established states. All from Eric Dumazet. 12) Fix interface leak in hsr code, from Taehee Yoo. 13) VF reset fixes in hns3 driver, from Huazhong Tan. 14) Make loopback work again with ipv6 anycast, from David Ahern. 15) Fix TX starvation under high load in fec driver, from Tobias Waldekranz. 16) MLD2 payload lengths not checked properly in bridge multicast code, from Linus Lüssing. 17) Packet scheduler code that wants to find the inner protocol currently only works for one level of VLAN encapsulation. Allow Q-in-Q situations to work properly here, from Toke Høiland-Jørgensen. 18) Fix route leak in l2tp, from Xin Long. 19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport support and various protocols. From Martin KaFai Lau. 20) Fix socket cgroup v2 reference counting in some situations, from Cong Wang. 21) Cure memory leak in mlx5 connection tracking offload support, from Eli Britstein. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) mlxsw: pci: Fix use-after-free in case of failed devlink reload mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() net: macb: fix call to pm_runtime in the suspend/resume functions net: macb: fix macb_suspend() by removing call to netif_carrier_off() net: macb: fix macb_get/set_wol() when moving to phylink net: macb: mark device wake capable when "magic-packet" property present net: macb: fix wakeup test in runtime suspend/resume routines bnxt_en: fix NULL dereference in case SR-IOV configuration fails libbpf: Fix libbpf hashmap on (I)LP32 architectures net/mlx5e: CT: Fix memory leak in cleanup net/mlx5e: Fix port buffers cell size value net/mlx5e: Fix 50G per lane indication net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash net/mlx5e: Fix VXLAN configuration restore after function reload net/mlx5e: Fix usage of rcu-protected pointer net/mxl5e: Verify that rpriv is not NULL net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode net/mlx5: Fix eeprom support for SFP module cgroup: Fix sock_cgroup_data on big-endian. selftests: bpf: Fix detach from sockmap tests ...
2020-07-09Merge tag 'kallsyms_show_value-v5.8-rc5' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kallsyms fix from Kees Cook: "Refactor kallsyms_show_value() users for correct cred. I'm not delighted by the timing of getting these changes to you, but it does fix a handful of kernel address exposures, and no one has screamed yet at the patches. Several users of kallsyms_show_value() were performing checks not during "open". Refactor everything needed to gain proper checks against file->f_cred for modules, kprobes, and bpf" * tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests: kmod: Add module address visibility test bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() kprobes: Do not expose probe addresses to non-CAP_SYSLOG module: Do not expose section addresses to non-CAP_SYSLOG module: Refactor section attr into bin attribute kallsyms: Refactor kallsyms_show_value() to take cred
2020-07-09bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok()Kees Cook1-1/+1
When evaluating access control over kallsyms visibility, credentials at open() time need to be used, not the "current" creds (though in BPF's case, this has likely always been the same). Plumb access to associated file->f_cred down through bpf_dump_raw_ok() and its callers now that kallsysm_show_value() has been refactored to take struct cred. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump") Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08net: Restrict receive packets queuing to housekeeping CPUsAlex Belits1-1/+9
With the existing implementation of store_rps_map(), packets are queued in the receive path on the backlog queues of other CPUs irrespective of whether they are isolated or not. This could add a latency overhead to any RT workload that is running on the same CPU. Ensure that store_rps_map() only uses available housekeeping CPUs for storing the rps_map. Signed-off-by: Alex Belits <abelits@marvell.com> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200625223443.2684-4-nitesh@redhat.com
2020-07-07cgroup: fix cgroup_sk_alloc() for sk_clone_lock()Cong Wang1-1/+1
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit 090e28b229af ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04sched: consistently handle layer3 header accesses in the presence of VLANsToke Høiland-Jørgensen1-3/+7
There are a couple of places in net/sched/ that check skb->protocol and act on the value there. However, in the presence of VLAN tags, the value stored in skb->protocol can be inconsistent based on whether VLAN acceleration is enabled. The commit quoted in the Fixes tag below fixed the users of skb->protocol to use a helper that will always see the VLAN ethertype. However, most of the callers don't actually handle the VLAN ethertype, but expect to find the IP header type in the protocol field. This means that things like changing the ECN field, or parsing diffserv values, stops working if there's a VLAN tag, or if there are multiple nested VLAN tags (QinQ). To fix this, change the helper to take an argument that indicates whether the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we make sure to skip all of them, so behaviour is consistent even in QinQ mode. To make the helper usable from the ECN code, move it to if_vlan.h instead of pkt_sched.h. v3: - Remove empty lines - Move vlan variable definitions inside loop in skb_protocol() - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and bpf_skb_ecn_set_ce() v2: - Use eth_type_vlan() helper in skb_protocol() - Also fix code that reads skb->protocol directly - Change a couple of 'if/else if' statements to switch constructs to avoid calling the helper twice Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-33/+75
Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer1-5/+45
The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30bpf: sockmap: Check value of unused args to BPF_PROG_ATTACHLorenz Bauer1-0/+3
Using BPF_PROG_ATTACH on a sockmap program currently understands no flags or replace_bpf_fd, but accepts any value. Return EINVAL instead. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki1-9/+10
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki1-11/+2
Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-29docs: RCU: Convert rculist_nulls.txt to ReSTMauro Carvalho Chehab1-2/+2
- Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29net: explain the lockdep annotations for dev_uc_unsync()Cong Wang1-0/+10
The lockdep annotations for dev_uc_unsync() and dev_mc_unsync() are not easy to understand, so add some comments to explain why they are correct. Similar for the rest netif_addr_lock_bh() cases, they don't need nested version. Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28bpf, sockmap: RCU dereferenced psock may be used outside RCU blockJohn Fastabend1-1/+9
If an ingress verdict program specifies message sizes greater than skb->len and there is an ENOMEM error due to memory pressure we may call the rcv_msg handler outside the strp_data_ready() caller context. This is because on an ENOMEM error the strparser will retry from a workqueue. The caller currently protects the use of psock by calling the strp_data_ready() inside a rcu_read_lock/unlock block. But, in above workqueue error case the psock is accessed outside the read_lock/unlock block of the caller. So instead of using psock directly we must do a look up against the sk again to ensure the psock is available. There is an an ugly piece here where we must handle the case where we paused the strp and removed the psock. On psock removal we first pause the strparser and then remove the psock. If the strparser is paused while an skb is scheduled on the workqueue the skb will be dropped on the flow and kfree_skb() is called. If the workqueue manages to get called before we pause the strparser but runs the rcvmsg callback after the psock is removed we will hit the unlikely case where we run the sockmap rcvmsg handler but do not have a psock. For now we will follow strparser logic and drop the skb on the floor with skb_kfree(). This is ugly because the data is dropped. To date this has not caused problems in practice because either the application controlling the sockmap is coordinating with the datapath so that skbs are "flushed" before removal or we simply wait for the sock to be closed before removing it. This patch fixes the describe RCU bug and dropping the skb doesn't make things worse. Future patches will improve this by allowing the normal case where skbs are not merged to skip the strparser altogether. In practice many (most?) use cases have no need to merge skbs so its both a code complexity hit as seen above and a performance issue. For example, in the Cilium case we always set the strparser up to return sbks 1:1 without any merging and have avoided above issues. Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
2020-06-28bpf, sockmap: RCU splat with redirect and strparser error or TLSJohn Fastabend1-7/+6
There are two paths to generate the below RCU splat the first and most obvious is the result of the BPF verdict program issuing a redirect on a TLS socket (This is the splat shown below). Unlike the non-TLS case the caller of the *strp_read() hooks does not wrap the call in a rcu_read_lock/unlock. Then if the BPF program issues a redirect action we hit the RCU splat. However, in the non-TLS socket case the splat appears to be relatively rare, because the skmsg caller into the strp_data_ready() is wrapped in a rcu_read_lock/unlock. Shown here, static void sk_psock_strp_data_ready(struct sock *sk) { struct sk_psock *psock; rcu_read_lock(); psock = sk_psock(sk); if (likely(psock)) { if (tls_sw_has_ctx_rx(sk)) { psock->parser.saved_data_ready(sk); } else { write_lock_bh(&sk->sk_callback_lock); strp_data_ready(&psock->parser.strp); write_unlock_bh(&sk->sk_callback_lock); } } rcu_read_unlock(); } If the above was the only way to run the verdict program we would be safe. But, there is a case where the strparser may throw an ENOMEM error while parsing the skb. This is a result of a failed skb_clone, or alloc_skb_for_msg while building a new merged skb when the msg length needed spans multiple skbs. This will in turn put the skb on the strp_wrk workqueue in the strparser code. The skb will later be dequeued and verdict programs run, but now from a different context without the rcu_read_lock()/unlock() critical section in sk_psock_strp_data_ready() shown above. In practice I have not seen this yet, because as far as I know most users of the verdict programs are also only working on single skbs. In this case no merge happens which could trigger the above ENOMEM errors. In addition the system would need to be under memory pressure. For example, we can't hit the above case in selftests because we missed having tests to merge skbs. (Added in later patch) To fix the below splat extend the rcu_read_lock/unnlock block to include the call to sk_psock_tls_verdict_apply(). This will fix both TLS redirect case and non-TLS redirect+error case. Also remove psock from the sk_psock_tls_verdict_apply() function signature its not used there. [ 1095.937597] WARNING: suspicious RCU usage [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W [ 1095.944363] ----------------------------- [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage! [ 1095.950866] [ 1095.950866] other info that might help us debug this: [ 1095.950866] [ 1095.957146] [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1 [ 1095.961482] 1 lock held by test_sockmap/15970: [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls] [ 1095.968568] [ 1095.968568] stack backtrace: [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1 [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 1095.980519] Call Trace: [ 1095.982191] dump_stack+0x8f/0xd0 [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0 [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250 [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls] v2: Improve commit message to identify non-TLS redirect plus error case condition as well as more common TLS case. In the process I decided doing the rcu_read_unlock followed by the lock/unlock inside branches was unnecessarily complex. We can just extend the current rcu block and get the same effeective without the shuffling and branching. Thanks Martin! Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
2020-06-24net: Do not clear the sock TX queue in sk_set_socket()Tariq Toukan1-0/+2
Clearing the sock TX queue in sk_set_socket() might cause unexpected out-of-order transmit when called from sock_orphan(), as outstanding packets can pick a different TX queue and bypass the ones already queued. This is undesired in general. More specifically, it breaks the in-order scheduling property guarantee for device-offloaded TLS sockets. Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it explicitly only where needed. Fixes: e022f0b4a03f ("net: Introduce sk_tx_queue_mapping") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-21net: Add MODULE_DESCRIPTION entries to network modulesRob Gill1-0/+1
The user tool modinfo is used to get information on kernel modules, including a description where it is available. This patch adds a brief MODULE_DESCRIPTION to the following modules: 9p drop_monitor esp4_offload esp6_offload fou fou6 ila sch_fq sch_fq_codel sch_hhf Signed-off-by: Rob Gill <rrobgill@protonmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20net: flow_offload: fix flow_indr_dev_unregister pathwenxu1-6/+10
If the representor is removed, then identify the indirect flow_blocks that need to be removed by the release callback and the port representor structure. To identify the port representor structure, a new indr.cb_priv field needs to be introduced. The flow_block also needs to be removed from the driver list from the cleanup path. Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20flow_offload: use flow_indr_block_cb_alloc/remove functionwenxu1-21/+1
Prepare fix the bug in the next patch. use flow_indr_block_cb_alloc/remove function and remove the __flow_block_indr_binding. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20flow_offload: add flow_indr_block_cb_alloc/remove functionwenxu1-0/+21
Add flow_indr_block_cb_alloc/remove function for next fix patch. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: increment xmit_recursion level in dev_direct_xmit()Eric Dumazet2-1/+3
Back in commit f60e5990d9c1 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Hannes added code so that IPv6 stack would not trust skb->sk for typical cases where packet goes through 'standard' xmit path (__dev_queue_xmit()) Alas af_packet had a dev_direct_xmit() path that was not dealing yet with xmit_recursion level. Also change sk_mc_loop() to dump a stack once only. Without this patch, syzbot was able to trigger : [1] [ 153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70 [ 153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [ 153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273 [ 153.567387] RIP: 0010:sk_mc_loop+0x51/0x70 [ 153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 <0f> 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0 [ 153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212 [ 153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007 [ 153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000 [ 153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008 [ 153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000 [ 153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00 [ 153.567390] FS: 00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000 [ 153.567390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0 [ 153.567391] Call Trace: [ 153.567391] ip6_finish_output2+0x34e/0x550 [ 153.567391] __ip6_finish_output+0xe7/0x110 [ 153.567391] ip6_finish_output+0x2d/0xb0 [ 153.567392] ip6_output+0x77/0x120 [ 153.567392] ? __ip6_finish_output+0x110/0x110 [ 153.567392] ip6_local_out+0x3d/0x50 [ 153.567392] ipvlan_queue_xmit+0x56c/0x5e0 [ 153.567393] ? ksize+0x19/0x30 [ 153.567393] ipvlan_start_xmit+0x18/0x50 [ 153.567393] dev_direct_xmit+0xf3/0x1c0 [ 153.567393] packet_direct_xmit+0x69/0xa0 [ 153.567394] packet_sendmsg+0xbf0/0x19b0 [ 153.567394] ? plist_del+0x62/0xb0 [ 153.567394] sock_sendmsg+0x65/0x70 [ 153.567394] sock_write_iter+0x93/0xf0 [ 153.567394] new_sync_write+0x18e/0x1a0 [ 153.567395] __vfs_write+0x29/0x40 [ 153.567395] vfs_write+0xb9/0x1b0 [ 153.567395] ksys_write+0xb1/0xe0 [ 153.567395] __x64_sys_write+0x1a/0x20 [ 153.567395] do_syscall_64+0x43/0x70 [ 153.567396] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 153.567396] RIP: 0033:0x453549 [ 153.567396] Code: Bad RIP value. [ 153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549 [ 153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003 [ 153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000 [ 153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc [ 153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700 [ 153.567399] ---[ end trace c1d5ae2b1059ec62 ]--- f60e5990d9c1 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: fix memleak in register_netdevice()Yang Yingliang1-0/+7
I got a memleak report when doing some fuzz test: unreferenced object 0xffff888112584000 (size 13599): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 74 61 70 30 00 00 00 00 00 00 00 00 00 00 00 00 tap0............ 00 ee d9 19 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<000000002f60ba65>] __kmalloc_node+0x309/0x3a0 [<0000000075b211ec>] kvmalloc_node+0x7f/0xc0 [<00000000d3a97396>] alloc_netdev_mqs+0x76/0xfc0 [<00000000609c3655>] __tun_chr_ioctl+0x1456/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff888111845cc0 (size 8): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 8 bytes): 74 61 70 30 00 88 ff ff tap0.... backtrace: [<000000004c159777>] kstrdup+0x35/0x70 [<00000000d8b496ad>] kstrdup_const+0x3d/0x50 [<00000000494e884a>] kvasprintf_const+0xf1/0x180 [<0000000097880a2b>] kobject_set_name_vargs+0x56/0x140 [<000000008fbdfc7b>] dev_set_name+0xab/0xe0 [<000000005b99e3b4>] netdev_register_kobject+0xc0/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff88811886d800 (size 512): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff c0 66 3d a3 ff ff ff ff .........f=..... backtrace: [<0000000050315800>] device_add+0x61e/0x1950 [<0000000021008dfb>] netdev_register_kobject+0x17e/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 If call_netdevice_notifiers() failed, then rollback_registered() calls netdev_unregister_kobject() which holds the kobject. The reference cannot be put because the netdev won't be add to todo list, so it will leads a memleak, we need put the reference to avoid memleak. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17xdp: Handle frame_sz in xdp_convert_zc_to_xdp_frame()Hangbin Liu1-0/+1
In commit 34cc0b338a61 we only handled the frame_sz in convert_to_xdp_frame(). This patch will also handle frame_sz in xdp_convert_zc_to_xdp_frame(). Fixes: 34cc0b338a61 ("xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616103518.2963410-1-liuhangbin@gmail.com
2020-06-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds5-37/+63
Pull networking fixes from David Miller: 1) Fix cfg80211 deadlock, from Johannes Berg. 2) RXRPC fails to send norigications, from David Howells. 3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from Geliang Tang. 4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu. 5) The ucc_geth driver needs __netdev_watchdog_up exported, from Valentin Longchamp. 6) Fix hashtable memory leak in dccp, from Wang Hai. 7) Fix how nexthops are marked as FDB nexthops, from David Ahern. 8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni. 9) Fix crashes in tipc_disc_rcv(), from Tuong Lien. 10) Fix link speed reporting in iavf driver, from Brett Creeley. 11) When a channel is used for XSK and then reused again later for XSK, we forget to clear out the relevant data structures in mlx5 which causes all kinds of problems. Fix from Maxim Mikityanskiy. 12) Fix memory leak in genetlink, from Cong Wang. 13) Disallow sockmap attachments to UDP sockets, it simply won't work. From Lorenz Bauer. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) net: ethernet: ti: ale: fix allmulti for nu type ale net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init net: atm: Remove the error message according to the atomic context bpf: Undo internal BPF_PROBE_MEM in BPF insns dump libbpf: Support pre-initializing .bss global variables tools/bpftool: Fix skeleton codegen bpf: Fix memlock accounting for sock_hash bpf: sockmap: Don't attach programs to UDP sockets bpf: tcp: Recv() should return 0 when the peer socket is closed ibmvnic: Flush existing work items before device removal genetlink: clean up family attributes allocations net: ipa: header pad field only valid for AP->modem endpoint net: ipa: program upper nibbles of sequencer type net: ipa: fix modem LAN RX endpoint id net: ipa: program metadata mask differently ionic: add pcie_print_link_status rxrpc: Fix race between incoming ACK parser and retransmitter net/mlx5: E-Switch, Fix some error pointer dereferences net/mlx5: Don't fail driver on failure to create debugfs net/mlx5e: CT: Fix ipv6 nat header rewrite actions ...
2020-06-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-16/+41
Alexei Starovoitov says: ==================== pull-request: bpf 2020-06-12 The following pull-request contains BPF updates for your *net* tree. We've added 26 non-merge commits during the last 10 day(s) which contain a total of 27 files changed, 348 insertions(+), 93 deletions(-). The main changes are: 1) sock_hash accounting fix, from Andrey. 2) libbpf fix and probe_mem sanitizing, from Andrii. 3) sock_hash fixes, from Jakub. 4) devmap_val fix, from Jesper. 5) load_bytes_relative fix, from YiFei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-13bpf: Fix memlock accounting for sock_hashAndrey Ignatov1-0/+4
Add missed bpf_map_charge_init() in sock_hash_alloc() and correspondingly bpf_map_charge_finish() on ENOMEM. It was found accidentally while working on unrelated selftest that checks "map->memory.pages > 0" is true for all map types. Before: # bpftool m l ... 3692: sockhash name m_sockhash flags 0x0 key 4B value 4B max_entries 8 memlock 0B After: # bpftool m l ... 84: sockmap name m_sockmap flags 0x0 key 4B value 4B max_entries 8 memlock 4096B Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com
2020-06-13bpf: sockmap: Don't attach programs to UDP socketsLorenz Bauer1-4/+6
The stream parser infrastructure isn't set up to deal with UDP sockets, so we mustn't try to attach programs to them. I remember making this change at some point, but I must have lost it while rebasing or something similar. Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20200611172520.327602-1-lmb@cloudflare.com
2020-06-11net/filter: Permit reading NET in load_bytes_relative when MAC not setYiFei Zhu1-7/+9
Added a check in the switch case on start_header that checks for the existence of the header, and in the case that MAC is not set and the caller requests for MAC, -EFAULT. If the caller requests for NET then MAC's existence is completely ignored. There is no function to check NET header's existence and as far as cgroup_skb/egress is concerned it should always be set. Removed for ptr >= the start of header, considering offset is bounded unsigned and should always be true. len <= end - mac is redundant to ptr + len <= end. Fixes: 3eee1f75f2b9 ("bpf: fix bpf_skb_load_bytes_relative pkt length check") Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com
2020-06-11Merge branch 'work.sysctl' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull sysctl fixes from Al Viro: "Fixups to regressions in sysctl series" * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sysctl: reject gigantic reads/write to sysctl files cdrom: fix an incorrect __user annotation on cdrom_sysctl_info trace: fix an incorrect __user annotation on stack_trace_sysctl random: fix an incorrect __user annotation on proc_do_entropy net/sysctl: remove leftover __user annotations on neigh_proc_dointvec* net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
2020-06-09net: change addr_list_lock back to static keyCong Wang3-21/+22
The dynamic key update for addr_list_lock still causes troubles, for example the following race condition still exists: CPU 0: CPU 1: (RCU read lock) (RTNL lock) dev_mc_seq_show() netdev_update_lockdep_key() -> lockdep_unregister_key() -> netif_addr_lock_bh() because lockdep doesn't provide an API to update it atomically. Therefore, we have to move it back to static keys and use subclass for nest locking like before. In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key changes"), I already reverted most parts of commit ab92d68fc22f ("net: core: add generic lockdep keys"). This patch reverts the rest and also part of commit f3b0a18bb6cb ("net: remove unnecessary variables and callback"). After this patch, addr_list_lock changes back to using static keys and subclasses to satisfy lockdep. Thanks to dev->lower_level, we do not have to change back to ->ndo_get_lock_subclass(). And hopefully this reduces some syzbot lockdep noises too. Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-09bpf, sockhash: Synchronize delete from bucket list on map freeJakub Sitnicki1-2/+21
We can end up modifying the sockhash bucket list from two CPUs when a sockhash is being destroyed (sock_hash_free) on one CPU, while a socket that is in the sockhash is unlinking itself from it on another CPU it (sock_hash_delete_from_link). This results in accessing a list element that is in an undefined state as reported by KASAN: | ================================================================== | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280 | Write of size 8 at addr dead000000000122 by task kworker/2:1/95 | | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: events bpf_map_free_deferred | Call Trace: | dump_stack+0x97/0xe0 | ? sock_hash_free+0x13c/0x280 | __kasan_report.cold+0x5/0x40 | ? mark_lock+0xbc1/0xc00 | ? sock_hash_free+0x13c/0x280 | kasan_report+0x38/0x50 | ? sock_hash_free+0x152/0x280 | sock_hash_free+0x13c/0x280 | bpf_map_free_deferred+0xb2/0xd0 | ? bpf_map_charge_finish+0x50/0x50 | ? rcu_read_lock_sched_held+0x81/0xb0 | ? rcu_read_lock_bh_held+0x90/0x90 | process_one_work+0x59a/0xac0 | ? lock_release+0x3b0/0x3b0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x680 | ? _raw_spin_unlock_irqrestore+0x4c/0x60 | kthread+0x1cc/0x220 | ? process_one_work+0xac0/0xac0 | ? kthread_create_on_node+0xa0/0xa0 | ret_from_fork+0x24/0x30 | ================================================================== Fix it by reintroducing spin-lock protected critical section around the code that removes the elements from the bucket on sockhash free. To do that we also need to defer processing of removed elements, until out of atomic context so that we can unlink the socket from the map when holding the sock lock. Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com
2020-06-09bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_freeJakub Sitnicki1-0/+1
When sockhash gets destroyed while sockets are still linked to it, we will walk the bucket lists and delete the links. However, we are not freeing the list elements after processing them, leaking the memory. The leak can be triggered by close()'ing a sockhash map when it still contains sockets, and observed with kmemleak: unreferenced object 0xffff888116e86f00 (size 64): comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/..... backtrace: [<00000000dd089ebb>] sock_hash_update_common+0x4ca/0x760 [<00000000b8219bd5>] sock_hash_update_elem+0x1d2/0x200 [<000000005e2c23de>] __do_sys_bpf+0x2046/0x2990 [<00000000d0084618>] do_syscall_64+0xad/0x9a0 [<000000000d96f263>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fix it by freeing the list element when we're done with it. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com
2020-06-08net/sysctl: use cpumask_parse in flow_limit_cpu_sysctlChristoph Hellwig1-1/+1
cpumask_parse_user works on __user pointers, so this is wrong now. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Reported-by: build test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-05net: core: device_rename: Use rwsem instead of a seqcountAhmed S. Darwish1-22/+18
Sequence counters write paths are critical sections that must never be preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and netdev name retrieval.") handled a deadlock, observed with CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was infinitely spinning: it got scheduled after the seqcount write side blocked inside its own critical section. To fix that deadlock, among other issues, the commit added a cond_resched() inside the read side section. While this will get the non-preemptible kernel eventually unstuck, the seqcount reader is fully exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. The fix is also still broken: if the seqcount reader belongs to a real-time scheduling policy, it can spin forever and the kernel will livelock. Disabling preemption over the seqcount write side critical section will not work: inside it are a number of GFP_KERNEL allocations and mutex locking through the drivers/base/ :: device_rename() call chain. >From all the above, replace the seqcount with a rwsem. Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) Cc: <stable@vger.kernel.org> Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ] Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ] Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>