summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-11-25net/smc: Make sure the link_id is uniqueWen Gu1-1/+2
[ Upstream commit cf4f5530bb55ef7d5a91036b26676643b80b1616 ] The link_id is supposed to be unique, but smcr_next_link_id() doesn't skip the used link_id as expected. So the patch fixes this. Fixes: 026c381fb477 ("net/smc: introduce link_idx for link group array") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25sock: fix /proc/net/sockstat underflow in sk_clone_lock()Tetsuo Handa1-3/+3
[ Upstream commit 938cca9e4109b30ee1d476904538225a825e54eb ] sk_clone_lock() needs to call sock_inuse_add(1) before entering the sk_free_unlock_clone() error path, for __sk_free() from sk_free() from sk_free_unlock_clone() calls sock_inuse_add(-1). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 648845ab7e200993 ("sock: Move the socket inuse to namespace.") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25tipc: only accept encrypted MSG_CRYPTO msgsXin Long1-2/+5
[ Upstream commit 271351d255b09e39c7f6437738cba595f9b235be ] The MSG_CRYPTO msgs are always encrypted and sent to other nodes for keys' deployment. But when receiving in peers, if those nodes do not validate it and make sure it's encrypted, one could craft a malicious MSG_CRYPTO msg to deploy its key with no need to know other nodes' keys. This patch is to do that by checking TIPC_SKB_CB(skb)->decrypted and discard it if this packet never got decrypted. Note that this is also a supplementary fix to CVE-2021-43267 that can be triggered by an unencrypted malicious MSG_CRYPTO msg. Fixes: 1ef6f7c9390f ("tipc: add automatic session key exchange") Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25mac80211: fix monitor_sdata RCU/locking assertionsJohannes Berg3-8/+15
[ Upstream commit 6dd2360334f3cb3b45fc1b8194c670090474b87c ] Since commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") we've not only been protecting the pointer to monitor_sdata with the RTNL, but also with the wiphy->mtx. This is relevant in a number of lockdep assertions, e.g. the one we hit in ieee80211_set_monitor_channel(). However, we're now protecting all the assignments/dereferences, even the one in interface iter, with the wiphy->mtx, so switch over the lockdep assertions to that lock. Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20211112135143.cb8e8ceffef3.Iaa210f16f6904c8a7a24954fb3396da0ef86ec08@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25nl80211: fix radio statistics in survey dumpJohannes Berg2-20/+20
[ Upstream commit ce6b69749961426c6d822215ded9e67154e1ad4f ] Even if userspace specifies the NL80211_ATTR_SURVEY_RADIO_STATS attribute, we cannot get the statistics because we're not really parsing the incoming attributes properly any more. Fix this by passing the attrbuf to nl80211_prepare_wdev_dump() and filling it there, if given, and using a local version only if no output is desired. Since I'm touching it anyway, make nl80211_prepare_wdev_dump() static. Fixes: 50508d941c18 ("cfg80211: use parallel_ops for genl") Reported-by: Jan Fuchs <jf@simonwunderlich.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20211029092539.2851b4799386.If9736d4575ee79420cbec1bd930181e1d53c7317@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25tcp: Fix uninitialized access in skb frags array for Rx 0cp.Arjun Roy1-0/+3
[ Upstream commit 70701b83e208767f2720d8cd3e6a62cddafb3a30 ] TCP Receive zerocopy iterates through the SKB queue via tcp_recv_skb(), acquiring a pointer to an SKB and an offset within that SKB to read from. From there, it iterates the SKB frags array to determine which offset to start remapping pages from. However, this is built on the assumption that the offset read so far within the SKB is smaller than the SKB length. If this assumption is violated, we can attempt to read an invalid frags array element, which would cause a fault. tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN flag is set. Therefore, we must guard against this occurrence inside skb_advance_frag(). One way that we can reproduce this error follows: 1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with: char some_array[32 * 1024]; struct tcp_zerocopy_receive zc = { .copybuf_address = (__u64) &some_array[0], .copybuf_len = 32 * 1024, }; 2) In a sender program, after a TCP handshake, send the following sequence of packets: i) Seq = [X, X+4000] ii) Seq = [X+4000, X+5000] iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000 (This can happen without URG, if we have a signal pending, but URG is a convenient way to reproduce the behaviour). In this case, the following event sequence will occur on the receiver: tcp_zerocopy_receive(): -> receive_fallback_to_copy() // copybuf_len >= inq -> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG -> tcp_recv_skb() // yields skb with skb->len == offset -> tcp_zerocopy_set_hint_for_skb() -> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags -> find_next_mappable_frag() // will dereference this bad frags ptr. With this patch, skb_advance_to_frag() will no longer return an invalid frags pointer, and will return NULL instead, fixing the issue. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive") Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18SUNRPC: Partial revert of commit 6f9f17287e78Trond Myklebust1-13/+15
commit ea7a1019d8baf8503ecd6e3ec8436dec283569e6 upstream. The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in xprt_transmit()") was that cond_resched() is expensive and unnecessary when there has been just a single send. The point of cond_resched() is to ensure that tasks that should pre-empt this one get a chance to do so when it is safe to do so. The code prior to commit 6f9f17287e78 failed to take into account that it was keeping a rpc_task pinned for longer than it needed to, and so rather than doing a full revert, let's just move the cond_resched. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-189p/net: fix missing error check in p9_check_errorsDominique Martinet1-0/+2
commit 27eb4c3144f7a5ebef3c9a261d80cb3e1fa784dc upstream. Link: https://lkml.kernel.org/r/99338965-d36c-886e-cd0e-1d8fff2b4746@gmail.com Reported-by: syzbot+06472778c97ed94af66d@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18net, neigh: Enable state migration between NUD_PERMANENT and NTF_USEDaniel Borkmann1-9/+13
[ Upstream commit 3dc20f4762c62d3b3f0940644881ed818aa7b2f5 ] Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT state and NTF_USE flag with a dynamic NUD state from a user space control plane. Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing neighbor entry in combination with NTF_USE flag. This is due to the latter directly calling into neigh_event_send() without any meta data updates as happening in __neigh_update(). Thus, to enable this use case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the NUD_PERMANENT state in particular so that a latter neigh_event_send() is able to re-resolve a neighbor entry. Before fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] As can be seen, despite the admin-triggered replace, the entry remains in the NUD_PERMANENT state. After fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] After the fix, the admin-triggered replace switches to a dynamic state from the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can transition back from there, if needed, into NUD_PERMANENT. Similar before/after behavior can be observed for below transitions: Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [..] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net/smc: fix sk_refcnt underflow on linkdown and fallbackDust Li1-7/+11
[ Upstream commit e5d5aadcf3cd59949316df49c27cb21788d7efe4 ] We got the following WARNING when running ab/nginx test with RDMA link flapping (up-down-up). The reason is when smc_sock fallback and at linkdown happens simultaneously, we may got the following situation: __smc_lgr_terminate() --> smc_conn_kill() --> smc_close_active_abort() smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) smc_sock was set to SMC_CLOSED and sock_put() been called when terminate the link group. But later application call close() on the socket, then we got: __smc_release(): if (smc_sock->fallback) smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) Again we set the smc_sock to CLOSED through it's already in CLOSED state, and double put the refcnt, so the following warning happens: refcount_t: underflow; use-after-free. WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0 Modules linked in: CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 RIP: 0010:refcount_warn_saturate+0x8d/0xf0 Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48 RSP: 0018:ffffc90000527e50 EFLAGS: 00010286 RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027 RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001 R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340 R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8 FS: 00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: smc_release+0x353/0x3f0 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x93/0x230 task_work_run+0x65/0xa0 exit_to_user_mode_prepare+0xf9/0x100 syscall_exit_to_user_mode+0x27/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This patch adds check in __smc_release() to make sure we won't do an extra sock_put() and set the socket to CLOSED when its already in CLOSED state. Fixes: 51f1de79ad8e (net/smc: replace sock_put worker by socket refcounting) Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18vsock: prevent unnecessary refcnt inc for nonblocking connectEiichi Tsukata1-0/+2
[ Upstream commit c7cd82b90599fa10915f41e3dd9098a77d0aa7b6 ] Currently vosck_connect() increments sock refcount for nonblocking socket each time it's called, which can lead to memory leak if it's called multiple times because connect timeout function decrements sock refcount only once. Fixes it by making vsock_connect() return -EALREADY immediately when sock state is already SS_CONNECTING. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_anyEric Dumazet1-10/+17
[ Upstream commit 6dc25401cba4d428328eade8ceae717633fdd702 ] 1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls ktime_mono_to_any() with out-of-bound value. 2) if q->tk_offset is changed in taprio_parse_clockid(), taprio_get_time() might also call ktime_mono_to_any() with out-of-bound value as sysbot found: UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27 index 3 is out of range for type 'ktime_t *[3]' CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291 ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908 get_tcp_tstamp net/sched/sch_taprio.c:322 [inline] get_packet_txtime net/sched/sch_taprio.c:353 [inline] taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420 taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785 __dev_xmit_skb net/core/dev.c:3869 [inline] __dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194 batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108 batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline] batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline] batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Fixes: 7ede7b03484b ("taprio: make clock reference conversions easier") Fixes: 54002066100b ("taprio: Adjust timestamps for TCP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vedang Patel <vedang.patel@intel.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_regJussi Maki1-6/+30
[ Upstream commit b2c4618162ec615a15883a804cce7e27afecfa58 ] The current conversion of skb->data_end reads like this: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r2 +116) 563: (1f) r1 -= r11 But similar to the case in 84f44df664e9 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"), the code will read an incorrect skb->len when src == dst. In this case we end up generating this xlated code: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r1 +116) 563: (1f) r1 -= r11 ... where line 560 is the reading 4B of (skb->data + 112) instead of the intended skb->len Here the skb pointer in r1 gets set to skb->data and the later deref for skb->len ends up following skb->data instead of skb. This fixes the issue similarly to the patch mentioned above by creating an additional temporary variable and using to store the register when dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the cb context for sk_skb. Then we restore from the temp to ensure nothing is lost. Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code") Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and collidingJohn Fastabend2-9/+23
[ Upstream commit e0dc3b93bd7bcff8c3813d1df43e0908499c7cf0 ] Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling progress, e.g. offset and length of the skb. First this is poorly named and inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at this layer. But, more importantly strparser is using the following to access its metadata. (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data)) Where _strp_msg is defined as: struct _strp_msg { struct strp_msg strp; /* 0 8 */ int accum_len; /* 8 4 */ /* size: 12, cachelines: 1, members: 2 */ /* last cacheline: 12 bytes */ }; So we use 12 bytes of ->data[] in struct. However in BPF code running parser and verdict the user has read capabilities into the data[] array as well. Its not too problematic, but we should not be exposing internal state to BPF program. If its really needed then we can use the probe_read() APIs which allow reading kernel memory. And I don't believe cb[] layer poses any API breakage by moving this around because programs can't depend on cb[] across layers. In order to fix another issue with a ctx rewrite we need to stash a temp variable somewhere. To make this work cleanly this patch builds a cb struct for sk_skb types called sk_skb_cb struct. Then we can use this consistently in the strparser, sockmap space. Additionally we can start allowing ->cb[] write access after this. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18bpf, sockmap: Fix race in ingress receive verdict with redirect to selfJohn Fastabend1-0/+47
[ Upstream commit c5d2177a72a1659554922728fc407f59950aa929 ] A socket in a sockmap may have different combinations of programs attached depending on configuration. There can be no programs in which case the socket acts as a sink only. There can be a TX program in this case a BPF program is attached to sending side, but no RX program is attached. There can be an RX program only where sends have no BPF program attached, but receives are hooked with BPF. And finally, both TX and RX programs may be attached. Giving us the permutations: None, Tx, Rx, and TxRx To date most of our use cases have been TX case being used as a fast datapath to directly copy between local application and a userspace proxy. Or Rx cases and TxRX applications that are operating an in kernel based proxy. The traffic in the first case where we hook applications into a userspace application looks like this: AppA redirect AppB Tx <-----------> Rx | | + + TCP <--> lo <--> TCP In this case all traffic from AppA (after 3whs) is copied into the AppB ingress queue and no traffic is ever on the TCP recieive_queue. In the second case the application never receives, except in some rare error cases, traffic on the actual user space socket. Instead the send happens in the kernel. AppProxy socket pool sk0 ------------->{sk1,sk2, skn} ^ | | | | v ingress lb egress TCP TCP Here because traffic is never read off the socket with userspace recv() APIs there is only ever one reader on the sk receive_queue. Namely the BPF programs. However, we've started to introduce a third configuration where the BPF program on receive should process the data, but then the normal case is to push the data into the receive queue of AppB. AppB recv() (userspace) ----------------------- tcp_bpf_recvmsg() (kernel) | | | | | | ingress_msgQ | | | RX_BPF | | | v v sk->receive_queue This is different from the App{A,B} redirect because traffic is first received on the sk->receive_queue. Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg queue for any data handled by the BPF rx program and returned with PASS code so that it was enqueued on the ingress msg queue. Then if no data exists on that queue it checks the socket receive queue. Unfortunately, this is the same receive_queue the BPF program is reading data off of. So we get a race. Its possible for the recvmsg() hook to pull data off the receive_queue before the BPF hook has a chance to read it. It typically happens when an application is banging on recv() and getting EAGAINs. Until they manage to race with the RX BPF program. To fix this we note that before this patch at attach time when the socket is loaded into the map we check if it needs a TX program or just the base set of proto bpf hooks. Then it uses the above general RX hook regardless of if we have a BPF program attached at rx or not. This patch now extends this check to handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above race when an RX program is attached we use a new hook that is nearly identical to the old one except now we do not let the recv() call skip the RX BPF program. Now only the BPF program pulls data from sk->receive_queue and recv() only pulls data from the ingress msgQ post BPF program handling. With this resolved our AppB from above has been up and running for many hours without detecting any errors. We do this by correlating counters in RX BPF events and the AppB to ensure data is never skipping the BPF program. Selftests, was not able to detect this because we only run them for a short period of time on well ordered send/recvs so we don't get any of the noise we see in real application environments. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18bpf, sockmap: Remove unhash handler for BPF sockmap usageJohn Fastabend1-1/+0
[ Upstream commit b8b8315e39ffaca82e79d86dde26e9144addf66b ] We do not need to handle unhash from BPF side we can simply wait for the close to happen. The original concern was a socket could transition from ESTABLISHED state to a new state while the BPF hook was still attached. But, we convinced ourself this is no longer possible and we also improved BPF sockmap to handle listen sockets so this is no longer a problem. More importantly though there are cases where unhash is called when data is in the receive queue. The BPF unhash logic will flush this data which is wrong. To be correct it should keep the data in the receive queue and allow a receiving application to continue reading the data. This may happen when tcp_abort() is received for example. Instead of complicating the logic in unhash simply moving all this to tcp_close() hook solves this. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: dsa: felix: fix broken VLAN-tagged PTP under VLAN-aware bridgeVladimir Oltean1-0/+3
[ Upstream commit 92f62485b3715882cd397b0cbd80a96d179b86d6 ] Normally it is expected that the dsa_device_ops :: rcv() method finishes parsing the DSA tag and consumes it, then never looks at it again. But commit c0bcf537667c ("net: dsa: ocelot: add hardware timestamping support for Felix") added support for RX timestamping in a very unconventional way. On this switch, a partial timestamp is available in the DSA header, but the driver got away with not parsing that timestamp right away, but instead delayed that parsing for a little longer: dsa_switch_rcv(): nskb = cpu_dp->rcv(skb, dev); <------------- not here -> ocelot_rcv() ... skb = nskb; skb_push(skb, ETH_HLEN); skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, skb->dev); ... if (dsa_skb_defer_rx_timestamp(p, skb)) <--- but here -> felix_rxtstamp() return 0; When in felix_rxtstamp(), this driver accounted for the fact that eth_type_trans() happened in the meanwhile, so it got a hold of the extraction header again by subtracting (ETH_HLEN + OCELOT_TAG_LEN) bytes from the current skb->data. This worked for quite some time but was quite fragile from the very beginning. Not to mention that having DSA tag parsing split in two different files, under different folders (net/dsa/tag_ocelot.c vs drivers/net/dsa/ocelot/felix.c) made it quite non-obvious for patches to come that they might break this. Finally, the blamed commit does the following: at the end of ocelot_rcv(), it checks whether the skb payload contains a VLAN header. If it does, and this port is under a VLAN-aware bridge, that VLAN ID might not be correct in the sense that the packet might have suffered VLAN rewriting due to TCAM rules (VCAP IS1). So we consume the VLAN ID from the skb payload using __skb_vlan_pop(), and take the classified VLAN ID from the DSA tag, and construct a hwaccel VLAN tag with the classified VLAN, and the skb payload is VLAN-untagged. The big problem is that __skb_vlan_pop() does: memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN); __skb_pull(skb, VLAN_HLEN); aka it moves the Ethernet header 4 bytes to the right, and pulls 4 bytes from the skb headroom (effectively also moving skb->data, by definition). So for felix_rxtstamp()'s fragile logic, all bets are off now. Instead of having the "extraction" pointer point to the DSA header, it actually points to 4 bytes _inside_ the extraction header. Corollary, the last 4 bytes of the "extraction" header are in fact 4 stale bytes of the destination MAC address from the Ethernet header, from prior to the __skb_vlan_pop() movement. So of course, RX timestamps are completely bogus when the system is configured in this way. The fix is actually very simple: just don't structure the code like that. For better or worse, the DSA PTP timestamping API does not offer a straightforward way for drivers to present their RX timestamps, but other drivers (sja1105) have established a simple mechanism to carry their RX timestamp from dsa_device_ops :: rcv() all the way to dsa_switch_ops :: port_rxtstamp() and even later. That mechanism is to simply save the partial timestamp to the skb->cb, and complete it later. Question: why don't we simply populate the skb's struct skb_shared_hwtstamps from ocelot_rcv(), and bother with this complication of propagating the timestamp to felix_rxtstamp()? Answer: dsa_switch_ops :: port_rxtstamp() answers the question whether PTP packets need sleepable context to retrieve the full RX timestamp. Currently felix_rxtstamp() answers "no, thanks" to that question, and calls ocelot_ptp_gettime64() from softirq atomic context. This is understandable, since Felix VSC9959 is a PCIe memory-mapped switch, so hardware access does not require sleeping. But the felix driver is preparing for the introduction of other switches where hardware access is over a slow bus like SPI or MDIO: https://lore.kernel.org/lkml/20210814025003.2449143-1-colin.foster@in-advantage.com/ So I would like to keep this code structure, so the rework needed when that driver will need PTP support will be minimal (answer "yes, I need deferred context for this skb's RX timestamp", then the partial timestamp will still be found in the skb->cb. Fixes: ea440cd2d9b2 ("net: dsa: tag_ocelot: use VLAN information from tagging header when available") Reported-by: Po Liu <po.liu@nxp.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: vlan: fix a UAF in vlan_dev_real_dev()Ziyang Xuan2-3/+3
[ Upstream commit 563bcbae3ba233c275c244bfce2efe12938f5363 ] The real_dev of a vlan net_device may be freed after unregister_vlan_dev(). Access the real_dev continually by vlan_dev_real_dev() will trigger the UAF problem for the real_dev like following: ================================================================== BUG: KASAN: use-after-free in vlan_dev_real_dev+0xf9/0x120 Call Trace: kasan_report.cold+0x83/0xdf vlan_dev_real_dev+0xf9/0x120 is_eth_port_of_netdev_filter.part.0+0xb1/0x2c0 is_eth_port_of_netdev_filter+0x28/0x40 ib_enum_roce_netdev+0x1a3/0x300 ib_enum_all_roce_netdevs+0xc7/0x140 netdevice_event_work_handler+0x9d/0x210 ... Freed by task 9288: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xfc/0x130 slab_free_freelist_hook+0xdd/0x240 kfree+0xe4/0x690 kvfree+0x42/0x50 device_release+0x9f/0x240 kobject_put+0x1c8/0x530 put_device+0x1b/0x30 free_netdev+0x370/0x540 ppp_destroy_interface+0x313/0x3d0 ... Move the put_device(real_dev) to vlan_dev_free(). Ensure real_dev not be freed before vlan_dev unregistered. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+e4df4e1389e28972e955@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18ethtool: fix ethtool msg len calculation for pause statsJakub Kicinski1-2/+1
[ Upstream commit 1aabe578dd86e9f2867c4db4fba9a15f4ba1825d ] ETHTOOL_A_PAUSE_STAT_MAX is the MAX attribute id, so we need to subtract non-stats and add one to get a count (IOW -2+1 == -1). Otherwise we'll see: ethnl cmd 21: calculated reply length 40, but consumed 52 Fixes: 9a27a33027f2 ("ethtool: add standard pause stats") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18netfilter: nfnetlink_queue: fix OOB when mac header was clearedFlorian Westphal1-1/+1
[ Upstream commit 5648b5e1169ff1d6d6a46c35c0b5fbebd2a5cbb2 ] On 64bit platforms the MAC header is set to 0xffff on allocation and also when a helper like skb_unset_mac_header() is called. dev_parse_header may call skb_mac_header() which assumes valid mac offset: BUG: KASAN: use-after-free in eth_header_parse+0x75/0x90 Read of size 6 at addr ffff8881075a5c05 by task nf-queue/1364 Call Trace: memcpy+0x20/0x60 eth_header_parse+0x75/0x90 __nfqnl_enqueue_packet+0x1a61/0x3380 __nf_queue+0x597/0x1300 nf_queue+0xf/0x40 nf_hook_slow+0xed/0x190 nf_hook+0x184/0x440 ip_output+0x1c0/0x2a0 nf_reinject+0x26f/0x700 nfqnl_recv_verdict+0xa16/0x18b0 nfnetlink_rcv_msg+0x506/0xe70 The existing code only works if the skb has a mac header. Fixes: 2c38de4c1f8da7 ("netfilter: fix looped (broad|multi)cast's MAC handling") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nfsd: don't alloc under spinlock in rpc_parse_scope_idJ. Bruce Fields1-22/+18
[ Upstream commit 9b6e27d01adcec58e046c624874f8a124e8b07ec ] Dan Carpenter says: The patch d20c11d86d8f: "nfsd: Protect session creation and client confirm using client_lock" from Jul 30, 2014, leads to the following Smatch static checker warning: net/sunrpc/addr.c:178 rpc_parse_scope_id() warn: sleeping in atomic context Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: d20c11d86d8f ("nfsd: Protect session creation and client...") Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18skmsg: Lose offset info in sk_psock_skb_ingressLiu Jian1-10/+33
[ Upstream commit 7303524e04af49a47991e19f895c3b8cdc3796c7 ] If sockmap enable strparser, there are lose offset info in sk_psock_skb_ingress(). If the length determined by parse_msg function is not skb->len, the skb will be converted to sk_msg multiple times, and userspace app will get the data multiple times. Fix this by get the offset and length from strp_msg. And as Cong suggested, add one bit in skb->_sk_redir to distinguish enable or disable strparser. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18udp6: allow SO_MARK ctrl msg to affect routingJakub Kicinski1-1/+1
[ Upstream commit 42dcfd850e514b229d616a53dec06d0f2533217c ] Commit c6af0c227a22 ("ip: support SO_MARK cmsg") added propagation of SO_MARK from cmsg to skb->mark. For IPv4 and raw sockets the mark also affects route lookup, but in case of IPv6 the flow info is initialized before cmsg is parsed. Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg") Reported-and-tested-by: Xintong Hu <huxintong@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: bridge: fix uninitialized variables when BRIDGE_CFM is disabledIvan Vecera1-0/+2
[ Upstream commit 829e050eea69c7442441b714b6f5b339b5b8c367 ] Function br_get_link_af_size_filtered() calls br_cfm_{,peer}_mep_count() that return a count. When BRIDGE_CFM is not enabled these functions simply return -EOPNOTSUPP but do not modify count parameter and calling function then works with uninitialized variables. Modify these inline functions to return zero in count parameter. Fixes: b6d0425b816e ("bridge: cfm: Netlink Notifications.") Cc: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18sctp: return true only for pathmtu update in sctp_transport_pl_toobigXin Long1-3/+4
[ Upstream commit 75cf662c64dd8543f56c329c69eba18141c8fd9f ] sctp_transport_pl_toobig() supposes to return true only if there's pathmtu update, so that in sctp_icmp_frag_needed() it would call sctp_assoc_sync_pmtu() and sctp_retransmit(). This patch is to fix these return places in sctp_transport_pl_toobig(). Fixes: 836964083177 ("sctp: do state transition when receiving an icmp TOOBIG packet") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18sctp: allow IP fragmentation when PLPMTUD enters Error stateXin Long2-7/+10
[ Upstream commit 40171248bb8934537fec8fbaf718e57c8add187c ] Currently when PLPMTUD enters Error state, transport pathmtu will be set to MIN_PLPMTU(512) while probe is continuing with BASE_PLPMTU(1200). It will cause pathmtu to stay in a very small value, even if the real pmtu is some value like 1000. RFC8899 doesn't clearly say how to set the value in Error state. But one possibility could be keep using BASE_PLPMTU for the real pmtu, but allow to do IP fragmentation when it's in Error state. As it says in rfc8899#section-5.4: Some paths could be unable to sustain packets of the BASE_PLPMTU size. The Error State could be implemented to provide robustness to such paths. This allows fallback to a smaller than desired PLPMTU rather than suffer connectivity failure. This could utilize methods such as endpoint IP fragmentation to enable the PL sender to communicate using packets smaller than the BASE_PLPMTU. This patch is to set pmtu to BASE_PLPMTU instead of MIN_PLPMTU for Error state in sctp_transport_pl_send/toobig(), and set packet ipfragok for non-probe packets when it's in Error state. Fixes: 1dc68c194571 ("sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path") Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()Jon Maxwell1-1/+1
[ Upstream commit cf12e6f9124629b18a6182deefc0315f0a73a199 ] v1: Implement a more general statement as recommended by Eric Dumazet. The sequence number will be advanced, so this check will fix the FIN case and other cases. A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that the write_queue was not empty as determined by tcp_write_queue_empty() but the sk_buff containing the FIN flag had been freed and the socket was zombied in that state. Corresponding pcaps show no FIN from the Linux kernel on the wire. Some instrumentation was added to the kernel and it was found that there is a timing window where tcp_sendmsg() can run after tcp_send_fin(). tcp_sendmsg() will hit an error, for example: 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩ 1270 ▹ ▹ goto do_error;↩ tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent. If the other side sends a FIN packet the socket will transition to CLOSING and remain that way until the system is rebooted. Fix this by checking for the FIN flag in the sk_buff and don't free it if that is the case. Testing confirmed that fixed the issue. Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz> Reported-by: Simon Stier <simon.stier@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: dsa: avoid refcount warnings when ->port_{fdb,mdb}_del returns errorVladimir Oltean1-2/+2
[ Upstream commit 232deb3f9567ce37d99b8616a6c07c1fc0436abf ] At present, when either of ds->ops->port_fdb_del() or ds->ops->port_mdb_del() return a non-zero error code, we attempt to save the day and keep the data structure associated with that switchdev object, as the deletion procedure did not complete. However, the way in which we do this is suspicious to the checker in lib/refcount.c, who thinks it is buggy to increment a refcount that became zero, and that this is indicative of a use-after-free. Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: stream: don't purge sk_error_queue in sk_stream_kill_queues()Jakub Kicinski1-3/+0
[ Upstream commit 24bcbe1cc69fa52dc4f7b5b2456678ed464724d8 ] sk_stream_kill_queues() can be called on close when there are still outstanding skbs to transmit. Those skbs may try to queue notifications to the error queue (e.g. timestamps). If sk_stream_kill_queues() purges the queue without taking its lock the queue may get corrupted, and skbs leaked. This shows up as a warning about an rmem leak: WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x... The leak is always a multiple of 0x300 bytes (the value is in %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of an empty sk_buff. Indeed if we dump the socket state at the time of the warning the sk_error_queue is often (but not always) corrupted. The ->next pointer points back at the list head, but not the ->prev pointer. Indeed we can find the leaked skb by scanning the kernel memory for something that looks like an skb with ->sk = socket in question, and ->truesize = 0x300. The contents of ->cb[] of the skb confirms the suspicion that it is indeed a timestamp notification (as generated in __skb_complete_tx_timestamp()). Removing purging of sk_error_queue should be okay, since inet_sock_destruct() does it again once all socket refs are gone. Eric suggests this may cause sockets that go thru disconnect() to maintain notifications from the previous incarnations of the socket, but that should be okay since the race was there anyway, and disconnect() is not exactly dependable. Thanks to Jonathan Lemon and Omar Sandoval for help at various stages of tracing the issue. Fixes: cb9eff097831 ("net: new user space API for time stamping of incoming and outgoing packets") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18tcp: switch orphan_count to bare per-cpu countersEric Dumazet6-20/+42
[ Upstream commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832 ] Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more. Stefan Bach reported a serious spinlock contention in real workloads, that I was able to reproduce with a netfilter rule dropping incoming FIN packets. 53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath | ---queued_spin_lock_slowpath | --53.51%--_raw_spin_lock_irqsave | --53.51%--__percpu_counter_sum tcp_check_oom | |--39.03%--__tcp_close | tcp_close | inet_release | inet6_release | sock_close | __fput | ____fput | task_work_run | exit_to_usermode_loop | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __GI___libc_close | --14.48%--tcp_out_of_resources tcp_write_timeout tcp_retransmit_timer tcp_write_timer_handler tcp_write_timer call_timer_fn expire_timers __run_timers run_timer_softirq __softirqentry_text_start As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter batch for dst entries accounting"), default batch size is too big for the default value of tcp_max_orphans (262144). But even if we reduce batch sizes, there would still be cases where the estimated count of orphans is beyond the limit, and where tcp_too_many_orphans() has to call the expensive percpu_counter_sum_positive(). One solution is to use plain per-cpu counters, and have a timer to periodically refresh this cache. Updating this cache every 100ms seems about right, tcp pressure state is not radically changing over shorter periods. percpu_counter was nice 15 years ago while hosts had less than 16 cpus, not anymore by current standards. v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m, reported by kernel test robot <lkp@intel.com> Remove unused socket argument from tcp_too_many_orphans() Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bach <sfb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USEDaniel Borkmann1-12/+14
[ Upstream commit e4400bbf5b15750e1b59bf4722d18d99be60c69f ] The NTF_EXT_LEARNED neigh flag is usually propagated back to user space upon dump of the neighbor table. However, when used in combination with NTF_USE flag this is not the case despite exempting the entry from the garbage collector. This results in inconsistent state since entries are typically marked in neigh->flags with NTF_EXT_LEARNED, but here they are not. Fix it by propagating the creation flag to ___neigh_create(). Before fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] Fixes: 9ce33e46531d ("neighbour: support for NTF_EXT_LEARNED flag") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18netfilter: nft_dynset: relax superfluous check on set updatesPablo Neira Ayuso1-10/+1
[ Upstream commit 7b1394892de8d95748d05e3ee41e85edb4abbfa1 ] Relax this condition to make add and update commands idempotent for sets with no timeout. The eval function already checks if the set element timeout is available and updates it if the update command is used. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18cfg80211: always free wiphy specific regdomainJohannes Berg1-0/+10
[ Upstream commit e53e9828a8d2c6545e01ff9711f1221f2fd199ce ] In the (somewhat unlikely) event that we allocate a wiphy, then add a regdomain to it, and then fail registration, we leak the regdomain. Fix this by just always freeing it at the end, in the normal cases we'll free (and NULL) it during wiphy_unregister(). This happened when the wiphy settings were bad, and since they can be controlled by userspace with hwsim, syzbot was able to find this issue. Reported-by: syzbot+1638e7c770eef6b6c0d0@syzkaller.appspotmail.com Fixes: 3e0c3ff36c4c ("cfg80211: allow multiple driver regulatory_hints()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20210927131105.68b70cef4674.I4b9f0aa08c2af28555963b9fe3d34395bb72e0cc@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18mac80211: twt: don't use potentially unaligned pointerJohannes Berg1-3/+5
[ Upstream commit 7ff379ba2d4b7b205240e666601fe302207d73f8 ] Since we're pointing into a frame, the pointer to the twt_agrt->req_type struct member is potentially not aligned properly. Open-code le16p_replace_bits() to avoid passing an unaligned pointer. Reported-by: kernel test robot <lkp@intel.com> Fixes: f5a4c24e689f ("mac80211: introduce individual TWT support in AP mode") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20210927115124.e1208694f37b.Ie3de9bcc5dde5a79e3ac81f3185beafe4d214e57@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18mptcp: do not shrink snd_nxt when recoveringFlorian Westphal2-17/+34
[ Upstream commit 0d199e4363b482badcedba764e2aceab53a4a10a ] When recovering after a link failure, snd_nxt should not be set to a lower value. Else, update of snd_nxt is broken because: msk->snd_nxt += ret; (where ret is number of bytes sent) assumes that snd_nxt always moves forward. After reduction, its possible that snd_nxt update gets out of sync: dfrag we just sent might have had a data sequence number even past recovery_snd_nxt. This change factors the common msk state update to a helper and updates snd_nxt based on the current dfrag data sequence number. The conditional is required for the recovery phase where we may re-transmit old dfrags that are before current snd_nxt. After this change, snd_nxt only moves forward and covers all in-sequence data that was transmitted. recovery_snd_nxt is retained to detect when recovery has completed. Fixes: 1e1d9d6f119c5 ("mptcp: handle pending data on closed subflow") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rxrpc: Fix _usecs_to_jiffies() by using usecs_to_jiffies()Jiasheng Jiang1-1/+1
[ Upstream commit acde891c243c1ed85b19d4d5042bdf00914f5739 ] Directly using _usecs_to_jiffies() might be unsafe, so it's better to use usecs_to_jiffies() instead. Because we can see that the result of _usecs_to_jiffies() could be larger than MAX_JIFFY_OFFSET values without the check of the input. Fixes: c410bf01933e ("Fix the excessive initial retransmission timeout") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: net_namespace: Fix undefined member in key_remove_domain()Yajun Deng1-0/+4
[ Upstream commit aed0826b0cf2e488900ab92193893e803d65c070 ] The key_domain member in struct net only exists if we define CONFIG_KEYS. So we should add the define when we used key_domain. Fixes: 9b242610514f ("keys: Network namespace domain tag") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18Bluetooth: fix init and cleanup of sco_conn.timeout_workDesmond Cheong Zhi Xi1-5/+4
[ Upstream commit 49d8a5606428ca0962d09050a5af81461ff90fbb ] Before freeing struct sco_conn, all delayed timeout work should be cancelled. Otherwise, sco_sock_timeout could potentially use the sco_conn after it has been freed. Additionally, sco_conn.timeout_work should be initialized when the connection is allocated, not when the channel is added. This is because an sco_conn can create channels with multiple sockets over its lifetime, which happens if sockets are released but the connection isn't deleted. Fixes: ba316be1b6a0 ("Bluetooth: schedule SCO timeouts with delayed_work") Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream statePablo Neira Ayuso1-2/+5
[ Upstream commit b7b1d02fc43925a4d569ec221715db2dfa1ce4f5 ] The internal stream state sets the timeout to 120 seconds 2 seconds after the creation of the flow, attach this internal stream state to the IPS_ASSURED flag for consistent event reporting. Before this patch: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Note IPS_ASSURED for the flow not yet in the internal stream state. after this update: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 120 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Before this patch, short-lived UDP flows never entered IPS_ASSURED, so they were already candidate flow to be deleted by early_drop under stress. Before this patch, IPS_ASSURED is set on regardless the internal stream state, attach this internal stream state to IPS_ASSURED. packet #1 (original direction) enters NEW state packet #2 (reply direction) enters ESTABLISHED state, sets on IPS_SEEN_REPLY paclet #3 (any direction) sets on IPS_ASSURED (if 2 seconds since the creation has passed by). Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18inet: remove races in inet{6}_getname()Eric Dumazet2-17/+20
[ Upstream commit 9dfc685e0262d4c5e44e13302f89841fa75173ca ] syzbot reported data-races in inet_getname() multiple times, it is time we fix this instead of pretending applications should not trigger them. getsockname() and getpeername() are not really considered fast path. v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration needed when CONFIG_CGROUP_BPF=n, as reported by kernel test robot <lkp@intel.com> syzbot typical report: BUG: KCSAN: data-race in __inet_hash_connect / inet_getname write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1: __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728 __sys_connect_file net/socket.c:1896 [inline] __sys_connect+0x254/0x290 net/socket.c:1913 __do_sys_connect net/socket.c:1923 [inline] __se_sys_connect net/socket.c:1920 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1920 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0: inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946 __do_sys_getsockname net/socket.c:1961 [inline] __se_sys_getsockname net/socket.c:1958 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0xdee0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18gre/sit: Don't generate link-local addr if addr_gen_mode is ↵Stephen Suryaputra1-0/+3
IN6_ADDR_GEN_MODE_NONE [ Upstream commit 61e18ce7348bfefb5688a8bcd4b4d6b37c0f9b2a ] When addr_gen_mode is set to IN6_ADDR_GEN_MODE_NONE, the link-local addr should not be generated. But it isn't the case for GRE (as well as GRE6) and SIT tunnels. Make it so that tunnels consider the addr_gen_mode, especially for IN6_ADDR_GEN_MODE_NONE. Do this in add_v4_addrs() to cover both GRE and SIT only if the addr scope is link. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Antonio Quartulli <a@unstable.cc> Link: https://lore.kernel.org/r/20211020200618.467342-1-ssuryaextr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net-sysfs: try not to restart the syscall if it will fail eventuallyAntoine Tenart1-0/+55
[ Upstream commit 146e5e733310379f51924111068f08a3af0db830 ] Due to deadlocks in the networking subsystem spotted 12 years ago[1], a workaround was put in place[2] to avoid taking the rtnl lock when it was not available and restarting the syscall (back to VFS, letting userspace spin). The following construction is found a lot in the net sysfs and sysctl code: if (!rtnl_trylock()) return restart_syscall(); This can be problematic when multiple userspace threads use such interfaces in a short period, making them to spin a lot. This happens for example when adding and moving virtual interfaces: userspace programs listening on events, such as systemd-udevd and NetworkManager, do trigger actions reading files in sysfs. It gets worse when a lot of virtual interfaces are created concurrently, say when creating containers at boot time. Returning early without hitting the above pattern when the syscall will fail eventually does make things better. While it is not a fix for the issue, it does ease things. [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/ https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/ and https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/ [2] Rightfully, those deadlocks are *hard* to solve. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: sched: update default qdisc visibility after Tx queue cnt changesJakub Kicinski4-0/+58
[ Upstream commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20 ] mq / mqprio make the default child qdiscs visible. They only do so for the qdiscs which are within real_num_tx_queues when the device is registered. Depending on order of calls in the driver, or if user space changes config via ethtool -L the number of qdiscs visible under tc qdisc show will differ from the number of queues. This is confusing to users and potentially to system configuration scripts which try to make sure qdiscs have the right parameters. Add a new Qdisc_ops callback and make relevant qdiscs TTRT. Note that this uncovers the "shortcut" created by commit 1f27cde313d7 ("net: sched: use pfifo_fast for non real queues") The default child qdiscs beyond initial real_num_tx are always pfifo_fast, no matter what the sysfs setting is. Fixing this gets a little tricky because we'd need to keep a reference on whatever the default qdisc was at the time of creation. In practice this is likely an non-issue the qdiscs likely have to be configured to non-default settings, so whatever user space is doing such configuration can replace the pfifos... now that it will see them. Reported-by: Matthew Massey <matthewmassey@fb.com> Reviewed-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18Bluetooth: call sock_hold earlier in sco_conn_delDesmond Cheong Zhi Xi1-1/+2
[ Upstream commit f4712fa993f688d0a48e0c28728fcdeb88c1ea58 ] In sco_conn_del, conn->sk is read while holding on to the sco_conn.lock to avoid races with a socket that could be released concurrently. However, in between unlocking sco_conn.lock and calling sock_hold, it's possible for the socket to be freed, which would cause a use-after-free write when sock_hold is finally called. To fix this, the reference count of the socket should be increased while the sco_conn.lock is still held. Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18Bluetooth: fix use-after-free error in lock_sock_nested()Wang ShaoBo1-1/+9
[ Upstream commit 1bff51ea59a9afb67d2dd78518ab0582a54a472c ] use-after-free error in lock_sock_nested is reported: [ 179.140137][ T3731] ===================================================== [ 179.142675][ T3731] BUG: KMSAN: use-after-free in lock_sock_nested+0x280/0x2c0 [ 179.145494][ T3731] CPU: 4 PID: 3731 Comm: kworker/4:2 Not tainted 5.12.0-rc6+ #54 [ 179.148432][ T3731] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 179.151806][ T3731] Workqueue: events l2cap_chan_timeout [ 179.152730][ T3731] Call Trace: [ 179.153301][ T3731] dump_stack+0x24c/0x2e0 [ 179.154063][ T3731] kmsan_report+0xfb/0x1e0 [ 179.154855][ T3731] __msan_warning+0x5c/0xa0 [ 179.155579][ T3731] lock_sock_nested+0x280/0x2c0 [ 179.156436][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.157257][ T3731] l2cap_sock_teardown_cb+0xb8/0x890 [ 179.158154][ T3731] ? __msan_metadata_ptr_for_load_8+0x10/0x20 [ 179.159141][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.159994][ T3731] ? kmsan_get_shadow_origin_ptr+0x84/0xb0 [ 179.160959][ T3731] ? l2cap_sock_recv_cb+0x420/0x420 [ 179.161834][ T3731] l2cap_chan_del+0x3e1/0x1d50 [ 179.162608][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.163435][ T3731] ? kmsan_get_shadow_origin_ptr+0x84/0xb0 [ 179.164406][ T3731] l2cap_chan_close+0xeea/0x1050 [ 179.165189][ T3731] ? kmsan_internal_unpoison_shadow+0x42/0x70 [ 179.166180][ T3731] l2cap_chan_timeout+0x1da/0x590 [ 179.167066][ T3731] ? __msan_metadata_ptr_for_load_8+0x10/0x20 [ 179.168023][ T3731] ? l2cap_chan_create+0x560/0x560 [ 179.168818][ T3731] process_one_work+0x121d/0x1ff0 [ 179.169598][ T3731] worker_thread+0x121b/0x2370 [ 179.170346][ T3731] kthread+0x4ef/0x610 [ 179.171010][ T3731] ? process_one_work+0x1ff0/0x1ff0 [ 179.171828][ T3731] ? kthread_blkcg+0x110/0x110 [ 179.172587][ T3731] ret_from_fork+0x1f/0x30 [ 179.173348][ T3731] [ 179.173752][ T3731] Uninit was created at: [ 179.174409][ T3731] kmsan_internal_poison_shadow+0x5c/0xf0 [ 179.175373][ T3731] kmsan_slab_free+0x76/0xc0 [ 179.176060][ T3731] kfree+0x3a5/0x1180 [ 179.176664][ T3731] __sk_destruct+0x8af/0xb80 [ 179.177375][ T3731] __sk_free+0x812/0x8c0 [ 179.178032][ T3731] sk_free+0x97/0x130 [ 179.178686][ T3731] l2cap_sock_release+0x3d5/0x4d0 [ 179.179457][ T3731] sock_close+0x150/0x450 [ 179.180117][ T3731] __fput+0x6bd/0xf00 [ 179.180787][ T3731] ____fput+0x37/0x40 [ 179.181481][ T3731] task_work_run+0x140/0x280 [ 179.182219][ T3731] do_exit+0xe51/0x3e60 [ 179.182930][ T3731] do_group_exit+0x20e/0x450 [ 179.183656][ T3731] get_signal+0x2dfb/0x38f0 [ 179.184344][ T3731] arch_do_signal_or_restart+0xaa/0xe10 [ 179.185266][ T3731] exit_to_user_mode_prepare+0x2d2/0x560 [ 179.186136][ T3731] syscall_exit_to_user_mode+0x35/0x60 [ 179.186984][ T3731] do_syscall_64+0xc5/0x140 [ 179.187681][ T3731] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 179.188604][ T3731] ===================================================== In our case, there are two Thread A and B: Context: Thread A: Context: Thread B: l2cap_chan_timeout() __se_sys_shutdown() l2cap_chan_close() l2cap_sock_shutdown() l2cap_chan_del() l2cap_chan_close() l2cap_sock_teardown_cb() l2cap_sock_teardown_cb() Once l2cap_sock_teardown_cb() excuted, this sock will be marked as SOCK_ZAPPED, and can be treated as killable in l2cap_sock_kill() if sock_orphan() has excuted, at this time we close sock through sock_close() which end to call l2cap_sock_kill() like Thread C: Context: Thread C: sock_close() l2cap_sock_release() sock_orphan() l2cap_sock_kill() #free sock if refcnt is 1 If C completed, Once A or B reaches l2cap_sock_teardown_cb() again, use-after-free happened. We should set chan->data to NULL if sock is destructed, for telling teardown operation is not allowed in l2cap_sock_teardown_cb(), and also we should avoid killing an already killed socket in l2cap_sock_close_cb(). Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()Takashi Iwai1-8/+16
[ Upstream commit 99c23da0eed4fd20cae8243f2b51e10e66aa0951 ] The sco_send_frame() also takes lock_sock() during memcpy_from_msg() call that may be endlessly blocked by a task with userfaultd technique, and this will result in a hung task watchdog trigger. Just like the similar fix for hci_sock_sendmsg() in commit 92c685dc5de0 ("Bluetooth: reorganize functions..."), this patch moves the memcpy_from_msg() out of lock_sock() for addressing the hang. This should be the last piece for fixing CVE-2021-3640 after a few already queued fixes. Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18can: j1939: j1939_tp_cmd_recv(): check the dst address of TP.CM_BAMZhang Changzhong1-0/+5
commit 164051a6ab5445bd97f719f50b16db8b32174269 upstream. The TP.CM_BAM message must be sent to the global address [1], so add a check to drop TP.CM_BAM sent to a non-global address. Without this patch, the receiver will treat the following packets as normal RTS/CTS transport: 18EC0102#20090002FF002301 18EB0102#0100000000000000 18EB0102#020000FFFFFFFFFF [1] SAE-J1939-82 2015 A.3.3 Row 1. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/all/1635431907-15617-4-git-send-email-zhangchangzhong@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18can: j1939: j1939_can_recv(): ignore messages with invalid source addressZhang Changzhong1-0/+7
commit a79305e156db3d24fcd8eb649cdb3c3b2350e5c2 upstream. According to SAE-J1939-82 2015 (A.3.6 Row 2), a receiver should never send TP.CM_CTS to the global address, so we can add a check in j1939_can_recv() to drop messages with invalid source address. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/all/1635431907-15617-3-git-send-email-zhangchangzhong@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18can: j1939: j1939_tp_cmd_recv(): ignore abort message in the BAM transportZhang Changzhong1-0/+6
commit c0f49d98006f2db3333b917caac65bce2af9865c upstream. This patch prevents BAM transport from being closed by receiving abort message, as specified in SAE-J1939-82 2015 (A.3.3 Row 4). Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/all/1635431907-15617-2-git-send-email-zhangchangzhong@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18mctp: handle the struct sockaddr_mctp padding fieldsEugene Syromiatnikov1-0/+13
commit 1e4b50f06d970d8da3474d2a0354450416710bda upstream. In order to have the padding fields actually usable in the future, there have to be checks that user space doesn't supply non-zero garbage there. It is also worth setting these padding fields to zero, unless it is known that they have been already zeroed. Cc: stable@vger.kernel.org # v5.15 Fixes: 5a20dd46b8b84593 ("mctp: Be explicit about struct sockaddr_mctp padding") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>