From f54eeae970f4dd4400d8ef3157788fbd3e2dd0e3 Mon Sep 17 00:00:00 2001 From: Lorenz Bauer Date: Tue, 22 Feb 2022 10:39:24 +0000 Subject: bpf: Remove Lorenz Bauer from L7 BPF maintainers I'm leaving my position at Cloudflare and therefore won't have the necessary time and insight to maintain the sockmap code. It's in more capable hands with Jakub anyways. Signed-off-by: Lorenz Bauer Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20220222103925.25802-2-lmb@cloudflare.com --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2524b75763cd..766c27ca62ad 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10764,7 +10764,6 @@ L7 BPF FRAMEWORK M: John Fastabend M: Daniel Borkmann M: Jakub Sitnicki -M: Lorenz Bauer L: netdev@vger.kernel.org L: bpf@vger.kernel.org S: Maintained -- cgit v1.2.3 From 18b1ab7aa76bde181bdb1ab19a87fa9523c32f21 Mon Sep 17 00:00:00 2001 From: Magnus Karlsson Date: Mon, 28 Feb 2022 10:45:52 +0100 Subject: xsk: Fix race at socket teardown MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix a race in the xsk socket teardown code that can lead to a NULL pointer dereference splat. The current xsk unbind code in xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets xs->dev to NULL and then waits for any NAPI processing to terminate using synchronize_net(). After that, the release code starts to tear down the socket state and free allocated memory. BUG: kernel NULL pointer dereference, address: 00000000000000c0 PGD 8000000932469067 P4D 8000000932469067 PUD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G I 5.16.0+ #2 Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015 RIP: 0010:__xsk_sendmsg+0x2c/0x690 [...] RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258 RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800 RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800 R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000 FS: 00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0 Call Trace: ? aa_sk_perm+0x43/0x1b0 xsk_sendmsg+0xf0/0x110 sock_sendmsg+0x65/0x70 __sys_sendto+0x113/0x190 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x23/0x50 ? exit_to_user_mode_prepare+0xa5/0x1d0 __x64_sys_sendto+0x29/0x30 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae There are two problems with the current code. First, setting xs->dev to NULL before waiting for all users to stop using the socket is not correct. The entry to the data plane functions xsk_poll(), xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that xs->state is in the state XSK_BOUND and if not, it returns right away. But one process might have passed this test but still have not gotten to the point in which it uses xs->dev in the code. In this interim, a second process executing xsk_unbind_dev() might have set xs->dev to NULL which will lead to a crash for the first process. The solution here is just to get rid of this NULL assignment since it is not used anymore. Before commit 42fddcc7c64b ("xsk: use state member for socket synchronization"), xs->dev was the gatekeeper to admit processes into the data plane functions, but it was replaced with the state variable xs->state in the aforementioned commit. The second problem is that synchronize_net() does not wait for any process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete, which means that the state they rely on might be cleaned up prematurely. This can happen when the notifier gets called (at driver unload for example) as it uses xsk_unbind_dev(). Solve this by extending the RCU critical region from just the ndo_xsk_wakeup to the whole functions mentioned above, so that both the test of xs->state == XSK_BOUND and the last use of any member of xs is covered by the RCU critical section. This will guarantee that when synchronize_net() completes, there will be no processes left executing xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up safely. Note that we need to drop the RCU lock for the skb xmit path as it uses functions that might sleep. Due to this, we have to retest the xs->state after we grab the mutex that protects the skb xmit code from, among a number of things, an xsk_unbind_dev() being executed from the notifier at the same time. Fixes: 42fddcc7c64b ("xsk: use state member for socket synchronization") Reported-by: Elza Mathew Signed-off-by: Magnus Karlsson Signed-off-by: Daniel Borkmann Acked-by: Björn Töpel Link: https://lore.kernel.org/bpf/20220228094552.10134-1-magnus.karlsson@gmail.com --- net/xdp/xsk.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 50 insertions(+), 19 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 28ef3f4465ae..ac343cd8ff3f 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -403,18 +403,8 @@ EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch); static int xsk_wakeup(struct xdp_sock *xs, u8 flags) { struct net_device *dev = xs->dev; - int err; - - rcu_read_lock(); - err = dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags); - rcu_read_unlock(); - - return err; -} -static int xsk_zc_xmit(struct xdp_sock *xs) -{ - return xsk_wakeup(xs, XDP_WAKEUP_TX); + return dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags); } static void xsk_destruct_skb(struct sk_buff *skb) @@ -533,6 +523,12 @@ static int xsk_generic_xmit(struct sock *sk) mutex_lock(&xs->mutex); + /* Since we dropped the RCU read lock, the socket state might have changed. */ + if (unlikely(!xsk_is_bound(xs))) { + err = -ENXIO; + goto out; + } + if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; @@ -596,16 +592,26 @@ out: return err; } -static int __xsk_sendmsg(struct sock *sk) +static int xsk_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); + int ret; if (unlikely(!(xs->dev->flags & IFF_UP))) return -ENETDOWN; if (unlikely(!xs->tx)) return -ENOBUFS; - return xs->zc ? xsk_zc_xmit(xs) : xsk_generic_xmit(sk); + if (xs->zc) + return xsk_wakeup(xs, XDP_WAKEUP_TX); + + /* Drop the RCU lock since the SKB path might sleep. */ + rcu_read_unlock(); + ret = xsk_generic_xmit(sk); + /* Reaquire RCU lock before going into common code. */ + rcu_read_lock(); + + return ret; } static bool xsk_no_wakeup(struct sock *sk) @@ -619,7 +625,7 @@ static bool xsk_no_wakeup(struct sock *sk) #endif } -static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) +static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) { bool need_wait = !(m->msg_flags & MSG_DONTWAIT); struct sock *sk = sock->sk; @@ -639,11 +645,22 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) pool = xs->pool; if (pool->cached_need_wakeup & XDP_WAKEUP_TX) - return __xsk_sendmsg(sk); + return xsk_xmit(sk); return 0; } -static int xsk_recvmsg(struct socket *sock, struct msghdr *m, size_t len, int flags) +static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) +{ + int ret; + + rcu_read_lock(); + ret = __xsk_sendmsg(sock, m, total_len); + rcu_read_unlock(); + + return ret; +} + +static int __xsk_recvmsg(struct socket *sock, struct msghdr *m, size_t len, int flags) { bool need_wait = !(flags & MSG_DONTWAIT); struct sock *sk = sock->sk; @@ -669,6 +686,17 @@ static int xsk_recvmsg(struct socket *sock, struct msghdr *m, size_t len, int fl return 0; } +static int xsk_recvmsg(struct socket *sock, struct msghdr *m, size_t len, int flags) +{ + int ret; + + rcu_read_lock(); + ret = __xsk_recvmsg(sock, m, len, flags); + rcu_read_unlock(); + + return ret; +} + static __poll_t xsk_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait) { @@ -679,8 +707,11 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock, sock_poll_wait(file, sock, wait); - if (unlikely(!xsk_is_bound(xs))) + rcu_read_lock(); + if (unlikely(!xsk_is_bound(xs))) { + rcu_read_unlock(); return mask; + } pool = xs->pool; @@ -689,7 +720,7 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock, xsk_wakeup(xs, pool->cached_need_wakeup); else /* Poll needs to drive Tx also in copy mode */ - __xsk_sendmsg(sk); + xsk_xmit(sk); } if (xs->rx && !xskq_prod_is_empty(xs->rx)) @@ -697,6 +728,7 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock, if (xs->tx && xsk_tx_writeable(xs)) mask |= EPOLLOUT | EPOLLWRNORM; + rcu_read_unlock(); return mask; } @@ -728,7 +760,6 @@ static void xsk_unbind_dev(struct xdp_sock *xs) /* Wait for driver to stop using the xdp socket. */ xp_del_xsk(xs->pool, xs); - xs->dev = NULL; synchronize_net(); dev_put(dev); } -- cgit v1.2.3 From 0492d857636e1c52cd71594a723c4b26a7b31978 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 16 Mar 2022 11:19:43 +0100 Subject: netfilter: flowtable: Fix QinQ and pppoe support for inet table nf_flow_offload_inet_hook() does not check for 802.1q and PPPoE. Fetch inner ethertype from these encapsulation protocols. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Fixes: 4cd91f7c290f ("netfilter: flowtable: add vlan support") Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_flow_table.h | 18 ++++++++++++++++++ net/netfilter/nf_flow_table_inet.c | 17 +++++++++++++++++ net/netfilter/nf_flow_table_ip.c | 18 ------------------ 3 files changed, 35 insertions(+), 18 deletions(-) diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index bd59e950f4d6..64daafd1fc41 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -10,6 +10,8 @@ #include #include #include +#include +#include struct nf_flowtable; struct nf_flow_rule; @@ -317,4 +319,20 @@ int nf_flow_rule_route_ipv6(struct net *net, const struct flow_offload *flow, int nf_flow_table_offload_init(void); void nf_flow_table_offload_exit(void); +static inline __be16 nf_flow_pppoe_proto(const struct sk_buff *skb) +{ + __be16 proto; + + proto = *((__be16 *)(skb_mac_header(skb) + ETH_HLEN + + sizeof(struct pppoe_hdr))); + switch (proto) { + case htons(PPP_IP): + return htons(ETH_P_IP); + case htons(PPP_IPV6): + return htons(ETH_P_IPV6); + } + + return 0; +} + #endif /* _NF_FLOW_TABLE_H */ diff --git a/net/netfilter/nf_flow_table_inet.c b/net/netfilter/nf_flow_table_inet.c index 5c57ade6bd05..0ccabf3fa6aa 100644 --- a/net/netfilter/nf_flow_table_inet.c +++ b/net/netfilter/nf_flow_table_inet.c @@ -6,12 +6,29 @@ #include #include #include +#include static unsigned int nf_flow_offload_inet_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state) { + struct vlan_ethhdr *veth; + __be16 proto; + switch (skb->protocol) { + case htons(ETH_P_8021Q): + veth = (struct vlan_ethhdr *)skb_mac_header(skb); + proto = veth->h_vlan_encapsulated_proto; + break; + case htons(ETH_P_PPP_SES): + proto = nf_flow_pppoe_proto(skb); + break; + default: + proto = skb->protocol; + break; + } + + switch (proto) { case htons(ETH_P_IP): return nf_flow_offload_ip_hook(priv, skb, state); case htons(ETH_P_IPV6): diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 889cf88d3dba..6257d87c3a56 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -8,8 +8,6 @@ #include #include #include -#include -#include #include #include #include @@ -239,22 +237,6 @@ static unsigned int nf_flow_xmit_xfrm(struct sk_buff *skb, return NF_STOLEN; } -static inline __be16 nf_flow_pppoe_proto(const struct sk_buff *skb) -{ - __be16 proto; - - proto = *((__be16 *)(skb_mac_header(skb) + ETH_HLEN + - sizeof(struct pppoe_hdr))); - switch (proto) { - case htons(PPP_IP): - return htons(ETH_P_IP); - case htons(PPP_IPV6): - return htons(ETH_P_IPV6); - } - - return 0; -} - static bool nf_flow_skb_encap_protocol(const struct sk_buff *skb, __be16 proto, u32 *offset) { -- cgit v1.2.3 From 6e1acfa387b9ff82cfc7db8cc3b6959221a95851 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Thu, 17 Mar 2022 11:59:26 +0100 Subject: netfilter: nf_tables: validate registers coming from userspace. Bail out in case userspace uses unsupported registers. Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index d71a33ae39b3..1f5a0eece0d1 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -9275,17 +9275,23 @@ int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest) } EXPORT_SYMBOL_GPL(nft_parse_u32_check); -static unsigned int nft_parse_register(const struct nlattr *attr) +static unsigned int nft_parse_register(const struct nlattr *attr, u32 *preg) { unsigned int reg; reg = ntohl(nla_get_be32(attr)); switch (reg) { case NFT_REG_VERDICT...NFT_REG_4: - return reg * NFT_REG_SIZE / NFT_REG32_SIZE; + *preg = reg * NFT_REG_SIZE / NFT_REG32_SIZE; + break; + case NFT_REG32_00...NFT_REG32_15: + *preg = reg + NFT_REG_SIZE / NFT_REG32_SIZE - NFT_REG32_00; + break; default: - return reg + NFT_REG_SIZE / NFT_REG32_SIZE - NFT_REG32_00; + return -ERANGE; } + + return 0; } /** @@ -9327,7 +9333,10 @@ int nft_parse_register_load(const struct nlattr *attr, u8 *sreg, u32 len) u32 reg; int err; - reg = nft_parse_register(attr); + err = nft_parse_register(attr, ®); + if (err < 0) + return err; + err = nft_validate_register_load(reg, len); if (err < 0) return err; @@ -9382,7 +9391,10 @@ int nft_parse_register_store(const struct nft_ctx *ctx, int err; u32 reg; - reg = nft_parse_register(attr); + err = nft_parse_register(attr, ®); + if (err < 0) + return err; + err = nft_validate_register_store(ctx, reg, data, type, len); if (err < 0) return err; -- cgit v1.2.3 From 4c905f6740a365464e91467aa50916555b28213d Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Thu, 17 Mar 2022 12:04:42 +0100 Subject: netfilter: nf_tables: initialize registers in nft_do_chain() Initialize registers to avoid stack leak into userspace. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c index 36e73f9828c5..8af98239655d 100644 --- a/net/netfilter/nf_tables_core.c +++ b/net/netfilter/nf_tables_core.c @@ -201,7 +201,7 @@ nft_do_chain(struct nft_pktinfo *pkt, void *priv) const struct nft_rule_dp *rule, *last_rule; const struct net *net = nft_net(pkt); const struct nft_expr *expr, *last; - struct nft_regs regs; + struct nft_regs regs = {}; unsigned int stackptr = 0; struct nft_jumpstack jumpstack[NFT_JUMP_STACK_SIZE]; bool genbit = READ_ONCE(net->nft.gencursor); -- cgit v1.2.3 From 4219196d1f662cb10a462eb9e076633a3fc31a15 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu Date: Wed, 16 Mar 2022 18:12:31 -0700 Subject: ibmvnic: fix race between xmit and reset There is a race between reset and the transmit paths that can lead to ibmvnic_xmit() accessing an scrq after it has been freed in the reset path. It can result in a crash like: Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc0080000016189f8 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic] LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 Call Trace: [c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable) [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c9cfcc] sch_direct_xmit+0xec/0x330 [c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0 [c000000000c00ad4] __dev_queue_xmit+0x394/0x730 [c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding] [c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding] [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c00ca4] __dev_queue_xmit+0x564/0x730 [c000000000cf97e0] neigh_hh_output+0xd0/0x180 [c000000000cfa69c] ip_finish_output2+0x31c/0x5c0 [c000000000cfd244] __ip_queue_xmit+0x194/0x4f0 [c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0 [c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0 [c000000000d2d984] tcp_retransmit_skb+0x34/0x130 [c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0 [c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330 [c000000000d317bc] tcp_write_timer+0x5c/0x200 [c000000000243270] call_timer_fn+0x50/0x1c0 [c000000000243704] __run_timers.part.0+0x324/0x460 [c000000000243894] run_timer_softirq+0x54/0xa0 [c000000000ea713c] __do_softirq+0x15c/0x3e0 [c000000000166258] __irq_exit_rcu+0x158/0x190 [c000000000166420] irq_exit+0x20/0x40 [c00000000002853c] timer_interrupt+0x14c/0x2b0 [c000000000009a00] decrementer_common_virt+0x210/0x220 --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c The immediate cause of the crash is the access of tx_scrq in the following snippet during a reset, where the tx_scrq can be either NULL or an address that will soon be invalid: ibmvnic_xmit() { ... tx_scrq = adapter->tx_scrq[queue_num]; txq = netdev_get_tx_queue(netdev, queue_num); ind_bufp = &tx_scrq->ind_buf; if (test_bit(0, &adapter->resetting)) { ... } But beyond that, the call to ibmvnic_xmit() itself is not safe during a reset and the reset path attempts to avoid this by stopping the queue in ibmvnic_cleanup(). However just after the queue was stopped, an in-flight ibmvnic_complete_tx() could have restarted the queue even as the reset is progressing. Since the queue was restarted we could get a call to ibmvnic_xmit() which can then access the bad tx_scrq (or other fields). We cannot however simply have ibmvnic_complete_tx() check the ->resetting bit and skip starting the queue. This can race at the "back-end" of a good reset which just restarted the queue but has not cleared the ->resetting bit yet. If we skip restarting the queue due to ->resetting being true, the queue would remain stopped indefinitely potentially leading to transmit timeouts. IOW ->resetting is too broad for this purpose. Instead use a new flag that indicates whether or not the queues are active. Only the open/ reset paths control when the queues are active. ibmvnic_complete_tx() and others wake up the queue only if the queue is marked active. So we will have: A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open() ->resetting = true ->tx_queues_active = false disable tx queues ... ->tx_queues_active = true start tx queues B. Tx interrupt in ibmvnic_complete_tx(): if (->tx_queues_active) netif_wake_subqueue(); To ensure that ->tx_queues_active and state of the queues are consistent, we need a lock which: - must also be taken in the interrupt path (ibmvnic_complete_tx()) - shared across the multiple queues in the adapter (so they don't become serialized) Use rcu_read_lock() and have the reset thread synchronize_rcu() after updating the ->tx_queues_active state. While here, consolidate a few boolean fields in ibmvnic_adapter for better alignment. Based on discussions with Brian King and Dany Madden. Fixes: 7ed5b31f4a66 ("net/ibmvnic: prevent more than one thread from running in reset") Reported-by: Vaishnavi Bhat Signed-off-by: Sukadev Bhattiprolu Signed-off-by: David S. Miller --- drivers/net/ethernet/ibm/ibmvnic.c | 63 ++++++++++++++++++++++++++++++-------- drivers/net/ethernet/ibm/ibmvnic.h | 7 +++-- 2 files changed, 55 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index b423e94956f1..b4804ce63151 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1429,6 +1429,15 @@ static int __ibmvnic_open(struct net_device *netdev) return rc; } + adapter->tx_queues_active = true; + + /* Since queues were stopped until now, there shouldn't be any + * one in ibmvnic_complete_tx() or ibmvnic_xmit() so maybe we + * don't need the synchronize_rcu()? Leaving it for consistency + * with setting ->tx_queues_active = false. + */ + synchronize_rcu(); + netif_tx_start_all_queues(netdev); if (prev_state == VNIC_CLOSED) { @@ -1603,6 +1612,14 @@ static void ibmvnic_cleanup(struct net_device *netdev) struct ibmvnic_adapter *adapter = netdev_priv(netdev); /* ensure that transmissions are stopped if called by do_reset */ + + adapter->tx_queues_active = false; + + /* Ensure complete_tx() and ibmvnic_xmit() see ->tx_queues_active + * update so they don't restart a queue after we stop it below. + */ + synchronize_rcu(); + if (test_bit(0, &adapter->resetting)) netif_tx_disable(netdev); else @@ -1842,14 +1859,21 @@ static void ibmvnic_tx_scrq_clean_buffer(struct ibmvnic_adapter *adapter, tx_buff->skb = NULL; adapter->netdev->stats.tx_dropped++; } + ind_bufp->index = 0; + if (atomic_sub_return(entries, &tx_scrq->used) <= (adapter->req_tx_entries_per_subcrq / 2) && - __netif_subqueue_stopped(adapter->netdev, queue_num) && - !test_bit(0, &adapter->resetting)) { - netif_wake_subqueue(adapter->netdev, queue_num); - netdev_dbg(adapter->netdev, "Started queue %d\n", - queue_num); + __netif_subqueue_stopped(adapter->netdev, queue_num)) { + rcu_read_lock(); + + if (adapter->tx_queues_active) { + netif_wake_subqueue(adapter->netdev, queue_num); + netdev_dbg(adapter->netdev, "Started queue %d\n", + queue_num); + } + + rcu_read_unlock(); } } @@ -1904,11 +1928,12 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) int index = 0; u8 proto = 0; - tx_scrq = adapter->tx_scrq[queue_num]; - txq = netdev_get_tx_queue(netdev, queue_num); - ind_bufp = &tx_scrq->ind_buf; - - if (test_bit(0, &adapter->resetting)) { + /* If a reset is in progress, drop the packet since + * the scrqs may get torn down. Otherwise use the + * rcu to ensure reset waits for us to complete. + */ + rcu_read_lock(); + if (!adapter->tx_queues_active) { dev_kfree_skb_any(skb); tx_send_failed++; @@ -1917,6 +1942,10 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) goto out; } + tx_scrq = adapter->tx_scrq[queue_num]; + txq = netdev_get_tx_queue(netdev, queue_num); + ind_bufp = &tx_scrq->ind_buf; + if (ibmvnic_xmit_workarounds(skb, netdev)) { tx_dropped++; tx_send_failed++; @@ -1924,6 +1953,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) ibmvnic_tx_scrq_flush(adapter, tx_scrq); goto out; } + if (skb_is_gso(skb)) tx_pool = &adapter->tso_pool[queue_num]; else @@ -2078,6 +2108,7 @@ tx_err: netif_carrier_off(netdev); } out: + rcu_read_unlock(); netdev->stats.tx_dropped += tx_dropped; netdev->stats.tx_bytes += tx_bytes; netdev->stats.tx_packets += tx_packets; @@ -3732,9 +3763,15 @@ restart_loop: (adapter->req_tx_entries_per_subcrq / 2) && __netif_subqueue_stopped(adapter->netdev, scrq->pool_index)) { - netif_wake_subqueue(adapter->netdev, scrq->pool_index); - netdev_dbg(adapter->netdev, "Started queue %d\n", - scrq->pool_index); + rcu_read_lock(); + if (adapter->tx_queues_active) { + netif_wake_subqueue(adapter->netdev, + scrq->pool_index); + netdev_dbg(adapter->netdev, + "Started queue %d\n", + scrq->pool_index); + } + rcu_read_unlock(); } } diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h index fa2d607a7b1b..8f5cefb932dd 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.h +++ b/drivers/net/ethernet/ibm/ibmvnic.h @@ -1006,11 +1006,14 @@ struct ibmvnic_adapter { struct work_struct ibmvnic_reset; struct delayed_work ibmvnic_delayed_reset; unsigned long resetting; - bool napi_enabled, from_passive_init; - bool login_pending; /* last device reset time */ unsigned long last_reset_time; + bool napi_enabled; + bool from_passive_init; + bool login_pending; + /* protected by rcu */ + bool tx_queues_active; bool failover_pending; bool force_reset_recovery; -- cgit v1.2.3 From e82025c623e2bf04d162bafceb66a59115814479 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Thu, 17 Mar 2022 12:08:08 +0900 Subject: af_unix: Fix some data-races around unix_sk(sk)->oob_skb. Out-of-band data automatically places a "mark" showing wherein the sequence the out-of-band data would have been. If the out-of-band data implies cancelling everything sent so far, the "mark" is helpful to flush them. When the socket's read pointer reaches the "mark", the ioctl() below sets a non zero value to the arg `atmark`: The out-of-band data is queued in sk->sk_receive_queue as well as ordinary data and also saved in unix_sk(sk)->oob_skb. It can be used to test if the head of the receive queue is the out-of-band data meaning the socket is at the "mark". While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly. Thus, all accesses to oob_skb need some basic protection to avoid load/store tearing which KCSAN detects when these are called concurrently: - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark)) - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB) BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1: unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191) sock_sendmsg (net/socket.c:705 net/socket.c:725) __sys_sendto (net/socket.c:2040) __x64_sys_sendto (net/socket.c:2048) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0: unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1)) sock_do_ioctl (net/socket.c:1128) sock_ioctl (net/socket.c:1242) __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) value changed: 0xffff888003da0c00 -> 0xffff888003da0d00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2af682 #12 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014 Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Kuniyuki Iwashima Signed-off-by: David S. Miller --- net/unix/af_unix.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index c19569819866..0c37e5595aae 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2084,7 +2084,7 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other if (ousk->oob_skb) consume_skb(ousk->oob_skb); - ousk->oob_skb = skb; + WRITE_ONCE(ousk->oob_skb, skb); scm_stat_add(other, skb); skb_queue_tail(&other->sk_receive_queue, skb); @@ -2602,9 +2602,8 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state) oob_skb = u->oob_skb; - if (!(state->flags & MSG_PEEK)) { - u->oob_skb = NULL; - } + if (!(state->flags & MSG_PEEK)) + WRITE_ONCE(u->oob_skb, NULL); unix_state_unlock(sk); @@ -2639,7 +2638,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, skb = NULL; } else if (sock_flag(sk, SOCK_URGINLINE)) { if (!(flags & MSG_PEEK)) { - u->oob_skb = NULL; + WRITE_ONCE(u->oob_skb, NULL); consume_skb(skb); } } else if (!(flags & MSG_PEEK)) { @@ -3094,11 +3093,10 @@ static int unix_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) case SIOCATMARK: { struct sk_buff *skb; - struct unix_sock *u = unix_sk(sk); int answ = 0; skb = skb_peek(&sk->sk_receive_queue); - if (skb && skb == u->oob_skb) + if (skb && skb == READ_ONCE(unix_sk(sk)->oob_skb)) answ = 1; err = put_user(answ, (int __user *)arg); } -- cgit v1.2.3 From d9a232d435dcc966738b0f414a86f7edf4f4c8c4 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Thu, 17 Mar 2022 12:08:09 +0900 Subject: af_unix: Support POLLPRI for OOB. The commit 314001f0bf92 ("af_unix: Add OOB support") introduced OOB for AF_UNIX, but it lacks some changes for POLLPRI. Let's add the missing piece. In the selftest, normal datagrams are sent followed by OOB data, so this commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test case. Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Kuniyuki Iwashima Signed-off-by: David S. Miller --- net/unix/af_unix.c | 4 ++++ tools/testing/selftests/net/af_unix/test_unix_oob.c | 6 +++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 0c37e5595aae..1e7ed5829ed5 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -3137,6 +3137,10 @@ static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wa mask |= EPOLLIN | EPOLLRDNORM; if (sk_is_readable(sk)) mask |= EPOLLIN | EPOLLRDNORM; +#if IS_ENABLED(CONFIG_AF_UNIX_OOB) + if (READ_ONCE(unix_sk(sk)->oob_skb)) + mask |= EPOLLPRI; +#endif /* Connection-based need to check for termination and startup */ if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) && diff --git a/tools/testing/selftests/net/af_unix/test_unix_oob.c b/tools/testing/selftests/net/af_unix/test_unix_oob.c index 3dece8b29253..b57e91e1c3f2 100644 --- a/tools/testing/selftests/net/af_unix/test_unix_oob.c +++ b/tools/testing/selftests/net/af_unix/test_unix_oob.c @@ -218,10 +218,10 @@ main(int argc, char **argv) /* Test 1: * veriyf that SIGURG is - * delivered and 63 bytes are - * read and oob is '@' + * delivered, 63 bytes are + * read, oob is '@', and POLLPRI works. */ - wait_for_data(pfd, POLLIN | POLLPRI); + wait_for_data(pfd, POLLPRI); read_oob(pfd, &oob); len = read_data(pfd, buf, 1024); if (!signal_recvd || len != 63 || oob != '@') { -- cgit v1.2.3 From 544b4dd568e3b09c1ab38a759d3187e7abda11a0 Mon Sep 17 00:00:00 2001 From: Guillaume Nault Date: Thu, 17 Mar 2022 13:45:09 +0100 Subject: ipv4: Fix route lookups when handling ICMP redirects and PMTU updates The PMTU update and ICMP redirect helper functions initialise their fl4 variable with either __build_flow_key() or build_sk_flow_key(). These initialisation functions always set ->flowi4_scope with RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is not a problem when the route lookup is later done via ip_route_output_key_hash(), which properly clears the ECN bits from ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK flag. However, some helpers call fib_lookup() directly, without sanitising the tos and scope fields, so the route lookup can fail and, as a result, the ICMP redirect or PMTU update aren't taken into account. Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation code into ip_rt_fix_tos(), then use this function in handlers that call fib_lookup() directly. Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central place (like __build_flow_key() or flowi4_init_output()), because ip_route_output_key_hash() expects non-sanitised values. When called with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to sanitise the values only for those paths that don't call ip_route_output_key_hash(). Note 2: The problem is mostly about sanitising ->flowi4_tos. Having ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of RT_SCOPE_LINK probably wasn't really a problem: sockets with the SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set) normally shouldn't receive ICMP redirects or PMTU updates. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Guillaume Nault Reviewed-by: David Ahern Signed-off-by: Jakub Kicinski --- net/ipv4/route.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index f33ad1f383b6..d5d058de3664 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) } EXPORT_SYMBOL(__ip_select_ident); +static void ip_rt_fix_tos(struct flowi4 *fl4) +{ + __u8 tos = RT_FL_TOS(fl4); + + fl4->flowi4_tos = tos & IPTOS_RT_MASK; + fl4->flowi4_scope = tos & RTO_ONLINK ? + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE; +} + static void __build_flow_key(const struct net *net, struct flowi4 *fl4, const struct sock *sk, const struct iphdr *iph, @@ -824,6 +833,7 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf rt = (struct rtable *) dst; __build_flow_key(net, &fl4, sk, iph, oif, tos, prot, mark, 0); + ip_rt_fix_tos(&fl4); __ip_do_redirect(rt, skb, &fl4, true); } @@ -1048,6 +1058,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, struct flowi4 fl4; ip_rt_build_flow_key(&fl4, sk, skb); + ip_rt_fix_tos(&fl4); /* Don't make lookup fail for bridged encapsulations */ if (skb && netif_is_any_bridge_port(skb->dev)) @@ -1122,6 +1133,8 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu) goto out; new = true; + } else { + ip_rt_fix_tos(&fl4); } __ip_rt_update_pmtu((struct rtable *)xfrm_dst_path(&rt->dst), &fl4, mtu); @@ -2603,7 +2616,6 @@ add: struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, const struct sk_buff *skb) { - __u8 tos = RT_FL_TOS(fl4); struct fib_result res = { .type = RTN_UNSPEC, .fi = NULL, @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, struct rtable *rth; fl4->flowi4_iif = LOOPBACK_IFINDEX; - fl4->flowi4_tos = tos & IPTOS_RT_MASK; - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); + ip_rt_fix_tos(fl4); rcu_read_lock(); rth = ip_route_output_key_hash_rcu(net, fl4, &res, skb); -- cgit v1.2.3 From ec730c3e1f0e3a80612a9be2beb00e2b4f93fe70 Mon Sep 17 00:00:00 2001 From: Guillaume Nault Date: Thu, 17 Mar 2022 13:45:11 +0100 Subject: selftest: net: Test IPv4 PMTU exceptions with DSCP and ECN Add two tests to pmtu.sh, for verifying that PMTU exceptions get properly created for routes that don't belong to the main table. A fib-rule based on the packet's DSCP field is used to jump to the correct table. ECN shouldn't interfere with this process, so each test has two components: one that only sets DSCP and one that sets both DSCP and ECN. One of the test triggers PMTU exceptions using ICMP Echo Requests, the other using UDP packets (to test different handlers in the kernel). A few adjustments are necessary in the rest of the script to allow policy routing scenarios: * Add global variable rt_table that allows setup_routing_*() to add routes to a specific routing table. By default rt_table is set to "main", so existing tests don't need to be modified. * Another global variable, policy_mark, is used to define which dsfield value is used for policy routing. This variable has no effect on tests that don't use policy routing. * The UDP version of the test uses socat. So cleanup() now also need to kill socat PIDs. * route_get_dst_pmtu_from_exception() and route_get_dst_exception() now take an optional third argument specifying the dsfield. If not specified, 0 is used, so existing users don't need to be modified. Signed-off-by: Guillaume Nault Reviewed-by: David Ahern Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/pmtu.sh | 141 +++++++++++++++++++++++++++++++++++- 1 file changed, 137 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh index 694732e4b344..736e358dc549 100755 --- a/tools/testing/selftests/net/pmtu.sh +++ b/tools/testing/selftests/net/pmtu.sh @@ -26,6 +26,15 @@ # - pmtu_ipv6 # Same as pmtu_ipv4, except for locked PMTU tests, using IPv6 # +# - pmtu_ipv4_dscp_icmp_exception +# Set up the same network topology as pmtu_ipv4, but use non-default +# routing table in A. A fib-rule is used to jump to this routing table +# based on DSCP. Send ICMPv4 packets with the expected DSCP value and +# verify that ECN doesn't interfere with the creation of PMTU exceptions. +# +# - pmtu_ipv4_dscp_udp_exception +# Same as pmtu_ipv4_dscp_icmp_exception, but use UDP instead of ICMP. +# # - pmtu_ipv4_vxlan4_exception # Set up the same network topology as pmtu_ipv4, create a VXLAN tunnel # over IPv4 between A and B, routed via R1. On the link between R1 and B, @@ -203,6 +212,8 @@ which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping) tests=" pmtu_ipv4_exception ipv4: PMTU exceptions 1 pmtu_ipv6_exception ipv6: PMTU exceptions 1 + pmtu_ipv4_dscp_icmp_exception ICMPv4 with DSCP and ECN: PMTU exceptions 1 + pmtu_ipv4_dscp_udp_exception UDPv4 with DSCP and ECN: PMTU exceptions 1 pmtu_ipv4_vxlan4_exception IPv4 over vxlan4: PMTU exceptions 1 pmtu_ipv6_vxlan4_exception IPv6 over vxlan4: PMTU exceptions 1 pmtu_ipv4_vxlan6_exception IPv4 over vxlan6: PMTU exceptions 1 @@ -323,6 +334,9 @@ routes_nh=" B 6 default 61 " +policy_mark=0x04 +rt_table=main + veth4_a_addr="192.168.1.1" veth4_b_addr="192.168.1.2" veth4_c_addr="192.168.2.10" @@ -346,6 +360,7 @@ dummy6_mask="64" err_buf= tcpdump_pids= nettest_pids= +socat_pids= err() { err_buf="${err_buf}${1} @@ -723,7 +738,7 @@ setup_routing_old() { ns_name="$(nsname ${ns})" - ip -n ${ns_name} route add ${addr} via ${gw} + ip -n "${ns_name}" route add "${addr}" table "${rt_table}" via "${gw}" ns=""; addr=""; gw="" done @@ -753,7 +768,7 @@ setup_routing_new() { ns_name="$(nsname ${ns})" - ip -n ${ns_name} -${fam} route add ${addr} nhid ${nhid} + ip -n "${ns_name}" -"${fam}" route add "${addr}" table "${rt_table}" nhid "${nhid}" ns=""; fam=""; addr=""; nhid="" done @@ -798,6 +813,24 @@ setup_routing() { return 0 } +setup_policy_routing() { + setup_routing + + ip -netns "${NS_A}" -4 rule add dsfield "${policy_mark}" \ + table "${rt_table}" + + # Set the IPv4 Don't Fragment bit with tc, since socat doesn't seem to + # have an option do to it. + tc -netns "${NS_A}" qdisc replace dev veth_A-R1 root prio + tc -netns "${NS_A}" qdisc replace dev veth_A-R2 root prio + tc -netns "${NS_A}" filter add dev veth_A-R1 \ + protocol ipv4 flower ip_proto udp \ + action pedit ex munge ip df set 0x40 pipe csum ip and udp + tc -netns "${NS_A}" filter add dev veth_A-R2 \ + protocol ipv4 flower ip_proto udp \ + action pedit ex munge ip df set 0x40 pipe csum ip and udp +} + setup_bridge() { run_cmd ${ns_a} ip link add br0 type bridge || return $ksft_skip run_cmd ${ns_a} ip link set br0 up @@ -903,6 +936,11 @@ cleanup() { done nettest_pids= + for pid in ${socat_pids}; do + kill "${pid}" + done + socat_pids= + for n in ${NS_A} ${NS_B} ${NS_C} ${NS_R1} ${NS_R2}; do ip netns del ${n} 2> /dev/null done @@ -950,15 +988,21 @@ link_get_mtu() { route_get_dst_exception() { ns_cmd="${1}" dst="${2}" + dsfield="${3}" - ${ns_cmd} ip route get "${dst}" + if [ -z "${dsfield}" ]; then + dsfield=0 + fi + + ${ns_cmd} ip route get "${dst}" dsfield "${dsfield}" } route_get_dst_pmtu_from_exception() { ns_cmd="${1}" dst="${2}" + dsfield="${3}" - mtu_parse "$(route_get_dst_exception "${ns_cmd}" ${dst})" + mtu_parse "$(route_get_dst_exception "${ns_cmd}" "${dst}" "${dsfield}")" } check_pmtu_value() { @@ -1068,6 +1112,95 @@ test_pmtu_ipv6_exception() { test_pmtu_ipvX 6 } +test_pmtu_ipv4_dscp_icmp_exception() { + rt_table=100 + + setup namespaces policy_routing || return $ksft_skip + trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \ + "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \ + "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \ + "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2 + + # Set up initial MTU values + mtu "${ns_a}" veth_A-R1 2000 + mtu "${ns_r1}" veth_R1-A 2000 + mtu "${ns_r1}" veth_R1-B 1400 + mtu "${ns_b}" veth_B-R1 1400 + + mtu "${ns_a}" veth_A-R2 2000 + mtu "${ns_r2}" veth_R2-A 2000 + mtu "${ns_r2}" veth_R2-B 1500 + mtu "${ns_b}" veth_B-R2 1500 + + len=$((2000 - 20 - 8)) # Fills MTU of veth_A-R1 + + dst1="${prefix4}.${b_r1}.1" + dst2="${prefix4}.${b_r2}.1" + + # Create route exceptions + dsfield=${policy_mark} # No ECN bit set (Not-ECT) + run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst1}" + + dsfield=$(printf "%#x" $((policy_mark + 0x02))) # ECN=2 (ECT(0)) + run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst2}" + + # Check that exceptions have been created with the correct PMTU + pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")" + check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1 + + pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")" + check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1 +} + +test_pmtu_ipv4_dscp_udp_exception() { + rt_table=100 + + if ! which socat > /dev/null 2>&1; then + echo "'socat' command not found; skipping tests" + return $ksft_skip + fi + + setup namespaces policy_routing || return $ksft_skip + trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \ + "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \ + "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \ + "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2 + + # Set up initial MTU values + mtu "${ns_a}" veth_A-R1 2000 + mtu "${ns_r1}" veth_R1-A 2000 + mtu "${ns_r1}" veth_R1-B 1400 + mtu "${ns_b}" veth_B-R1 1400 + + mtu "${ns_a}" veth_A-R2 2000 + mtu "${ns_r2}" veth_R2-A 2000 + mtu "${ns_r2}" veth_R2-B 1500 + mtu "${ns_b}" veth_B-R2 1500 + + len=$((2000 - 20 - 8)) # Fills MTU of veth_A-R1 + + dst1="${prefix4}.${b_r1}.1" + dst2="${prefix4}.${b_r2}.1" + + # Create route exceptions + run_cmd_bg "${ns_b}" socat UDP-LISTEN:50000 OPEN:/dev/null,wronly=1 + socat_pids="${socat_pids} $!" + + dsfield=${policy_mark} # No ECN bit set (Not-ECT) + run_cmd "${ns_a}" socat OPEN:/dev/zero,rdonly=1,readbytes="${len}" \ + UDP:"${dst1}":50000,tos="${dsfield}" + + dsfield=$(printf "%#x" $((policy_mark + 0x02))) # ECN=2 (ECT(0)) + run_cmd "${ns_a}" socat OPEN:/dev/zero,rdonly=1,readbytes="${len}" \ + UDP:"${dst2}":50000,tos="${dsfield}" + + # Check that exceptions have been created with the correct PMTU + pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")" + check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1 + pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")" + check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1 +} + test_pmtu_ipvX_over_vxlanY_or_geneveY_exception() { type=${1} family=${2} -- cgit v1.2.3 From 3ef3905aa3b5b3e222ee6eb0210bfd999417a8cc Mon Sep 17 00:00:00 2001 From: Yonglong Li Date: Thu, 17 Mar 2022 15:09:53 -0700 Subject: mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skb Got crash when doing pressure test of mptcp: =========================================================================== dst_release: dst:ffffa06ce6e5c058 refcnt:-1 kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle kernel paging request at ffffa06ce6e5c058 PGD 190a01067 P4D 190a01067 PUD 43fffb067 PMD 22e403063 PTE 8000000226e5c063 Oops: 0011 [#1] SMP PTI CPU: 7 PID: 7823 Comm: kworker/7:0 Kdump: loaded Tainted: G E Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.2.1 04/01/2014 Call Trace: ? skb_release_head_state+0x68/0x100 ? skb_release_all+0xe/0x30 ? kfree_skb+0x32/0xa0 ? mptcp_sendmsg_frag+0x57e/0x750 ? __mptcp_retrans+0x21b/0x3c0 ? __switch_to_asm+0x35/0x70 ? mptcp_worker+0x25e/0x320 ? process_one_work+0x1a7/0x360 ? worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 ? kthread+0x112/0x130 ? kthread_flush_work_fn+0x10/0x10 ? ret_from_fork+0x35/0x40 =========================================================================== In __mptcp_alloc_tx_skb skb was allocated and skb->tcp_tsorted_anchor will be initialized, in under memory pressure situation sk_wmem_schedule will return false and then kfree_skb. In this case skb->_skb_refdst is not null because_skb_refdst and tcp_tsorted_anchor are stored in the same mem, and kfree_skb will try to release dst and cause crash. Fixes: f70cad1085d1 ("mptcp: stop relying on tcp_tx_skb_cache") Reviewed-by: Paolo Abeni Signed-off-by: Yonglong Li Signed-off-by: Mat Martineau Link: https://lore.kernel.org/r/20220317220953.426024-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski --- net/mptcp/protocol.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1c72f25f083e..014c9d88f947 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1196,6 +1196,7 @@ static struct sk_buff *__mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, g tcp_skb_entail(ssk, skb); return skb; } + tcp_skb_tsorted_anchor_cleanup(skb); kfree_skb(skb); return NULL; } -- cgit v1.2.3 From 0caf6d9922192dd1afa8dc2131abfb4df1443b9f Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Thu, 17 Mar 2022 15:53:06 +0100 Subject: af_netlink: Fix shift out of bounds in group mask calculation When a netlink message is received, netlink_recvmsg() fills in the address of the sender. One of the fields is the 32-bit bitfield nl_groups, which carries the multicast group on which the message was received. The least significant bit corresponds to group 1, and therefore the highest group that the field can represent is 32. Above that, the UB sanitizer flags the out-of-bounds shift attempts. Which bits end up being set in such case is implementation defined, but it's either going to be a wrong non-zero value, or zero, which is at least not misleading. Make the latter choice deterministic by always setting to 0 for higher-numbered multicast groups. To get information about membership in groups >= 32, userspace is expected to use nl_pktinfo control messages[0], which are enabled by NETLINK_PKTINFO socket option. [0] https://lwn.net/Articles/147608/ The way to trigger this issue is e.g. through monitoring the BRVLAN group: # bridge monitor vlan & # ip link add name br type bridge Which produces the following citation: UBSAN: shift-out-of-bounds in net/netlink/af_netlink.c:162:19 shift exponent 32 is too large for 32-bit type 'int' Fixes: f7fa9b10edbb ("[NETLINK]: Support dynamic number of multicast groups per netlink family") Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Link: https://lore.kernel.org/r/2bef6aabf201d1fc16cca139a744700cff9dcb04.1647527635.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- net/netlink/af_netlink.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 7b344035bfe3..47a876ccd288 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -159,6 +159,8 @@ EXPORT_SYMBOL(do_trace_netlink_extack); static inline u32 netlink_group_mask(u32 group) { + if (group > 32) + return 0; return group ? 1 << (group - 1) : 0; } -- cgit v1.2.3 From 9fd75b66b8f68498454d685dc4ba13192ae069b0 Mon Sep 17 00:00:00 2001 From: Duoming Zhou Date: Fri, 18 Mar 2022 08:54:04 +0800 Subject: ax25: Fix refcount leaks caused by ax25_cb_del() The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") and commit feef318c855a ("ax25: fix UAF bugs of net_device caused by rebinding operation") increase the refcounts of ax25_dev and net_device in ax25_bind() and decrease the matching refcounts in ax25_kill_by_device() in order to prevent UAF bugs, but there are reference count leaks. The root cause of refcount leaks is shown below: (Thread 1) | (Thread 2) ax25_bind() | ... | ax25_addr_ax25dev() | ax25_dev_hold() //(1) | ... | dev_hold_track() //(2) | ... | ax25_destroy_socket() | ax25_cb_del() | ... | hlist_del_init() //(3) | | (Thread 3) | ax25_kill_by_device() | ... | ax25_for_each(s, &ax25_list) { | if (s->ax25_dev == ax25_dev) //(4) | ... | Firstly, we use ax25_bind() to increase the refcount of ax25_dev in position (1) and increase the refcount of net_device in position (2). Then, we use ax25_cb_del() invoked by ax25_destroy_socket() to delete ax25_cb in hlist in position (3) before calling ax25_kill_by_device(). Finally, the decrements of refcounts in ax25_kill_by_device() will not be executed, because no s->ax25_dev equals to ax25_dev in position (4). This patch adds decrements of refcounts in ax25_release() and use lock_sock() to do synchronization. If refcounts decrease in ax25_release(), the decrements of refcounts in ax25_kill_by_device() will not be executed and vice versa. Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") Fixes: 87563a043cef ("ax25: fix reference count leaks of ax25_dev") Fixes: feef318c855a ("ax25: fix UAF bugs of net_device caused by rebinding operation") Reported-by: Thomas Osterried Signed-off-by: Duoming Zhou Signed-off-by: David S. Miller --- net/ax25/af_ax25.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c index 6bd097180772..cf8847cfc664 100644 --- a/net/ax25/af_ax25.c +++ b/net/ax25/af_ax25.c @@ -98,8 +98,10 @@ again: spin_unlock_bh(&ax25_list_lock); lock_sock(sk); s->ax25_dev = NULL; - dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker); - ax25_dev_put(ax25_dev); + if (sk->sk_socket) { + dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker); + ax25_dev_put(ax25_dev); + } ax25_disconnect(s, ENETUNREACH); release_sock(sk); spin_lock_bh(&ax25_list_lock); @@ -979,14 +981,20 @@ static int ax25_release(struct socket *sock) { struct sock *sk = sock->sk; ax25_cb *ax25; + ax25_dev *ax25_dev; if (sk == NULL) return 0; sock_hold(sk); - sock_orphan(sk); lock_sock(sk); + sock_orphan(sk); ax25 = sk_to_ax25(sk); + ax25_dev = ax25->ax25_dev; + if (ax25_dev) { + dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker); + ax25_dev_put(ax25_dev); + } if (sk->sk_type == SOCK_SEQPACKET) { switch (ax25->state) { -- cgit v1.2.3 From fc6d01ff9ef03b66d4a3a23b46fc3c3d8cf92009 Mon Sep 17 00:00:00 2001 From: Duoming Zhou Date: Fri, 18 Mar 2022 08:54:05 +0800 Subject: ax25: Fix NULL pointer dereferences in ax25 timers The previous commit 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect") move ax25_disconnect into lock_sock() in order to prevent NPD bugs. But there are race conditions that may lead to null pointer dereferences in ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(), ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we use ax25_kill_by_device() to detach the ax25 device. One of the race conditions that cause null pointer dereferences can be shown as below: (Thread 1) | (Thread 2) ax25_connect() | ax25_std_establish_data_link() | ax25_start_t1timer() | mod_timer(&ax25->t1timer,..) | | ax25_kill_by_device() (wait a time) | ... | s->ax25_dev = NULL; //(1) ax25_t1timer_expiry() | ax25->ax25_dev->values[..] //(2)| ... ... | We set null to ax25_cb->ax25_dev in position (1) and dereference the null pointer in position (2). The corresponding fail log is shown below: =============================================================== BUG: kernel NULL pointer dereference, address: 0000000000000050 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc6-00794-g45690b7d0 RIP: 0010:ax25_t1timer_expiry+0x12/0x40 ... Call Trace: call_timer_fn+0x21/0x120 __run_timers.part.0+0x1ca/0x250 run_timer_softirq+0x2c/0x60 __do_softirq+0xef/0x2f3 irq_exit_rcu+0xb6/0x100 sysvec_apic_timer_interrupt+0xa2/0xd0 ... This patch moves ax25_disconnect() before s->ax25_dev = NULL and uses del_timer_sync() to delete timers in ax25_disconnect(). If ax25_disconnect() is called by ax25_kill_by_device() or ax25->ax25_dev is NULL, the reason in ax25_disconnect() will be equal to ENETUNREACH, it will wait all timers to stop before we set null to s->ax25_dev in ax25_kill_by_device(). Fixes: 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect") Signed-off-by: Duoming Zhou Signed-off-by: David S. Miller --- net/ax25/af_ax25.c | 4 ++-- net/ax25/ax25_subr.c | 20 ++++++++++++++------ 2 files changed, 16 insertions(+), 8 deletions(-) diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c index cf8847cfc664..992b6e5d85d7 100644 --- a/net/ax25/af_ax25.c +++ b/net/ax25/af_ax25.c @@ -89,20 +89,20 @@ again: sk = s->sk; if (!sk) { spin_unlock_bh(&ax25_list_lock); - s->ax25_dev = NULL; ax25_disconnect(s, ENETUNREACH); + s->ax25_dev = NULL; spin_lock_bh(&ax25_list_lock); goto again; } sock_hold(sk); spin_unlock_bh(&ax25_list_lock); lock_sock(sk); + ax25_disconnect(s, ENETUNREACH); s->ax25_dev = NULL; if (sk->sk_socket) { dev_put_track(ax25_dev->dev, &ax25_dev->dev_tracker); ax25_dev_put(ax25_dev); } - ax25_disconnect(s, ENETUNREACH); release_sock(sk); spin_lock_bh(&ax25_list_lock); sock_put(sk); diff --git a/net/ax25/ax25_subr.c b/net/ax25/ax25_subr.c index 15ab812c4fe4..3a476e4f6cd0 100644 --- a/net/ax25/ax25_subr.c +++ b/net/ax25/ax25_subr.c @@ -261,12 +261,20 @@ void ax25_disconnect(ax25_cb *ax25, int reason) { ax25_clear_queues(ax25); - if (!ax25->sk || !sock_flag(ax25->sk, SOCK_DESTROY)) - ax25_stop_heartbeat(ax25); - ax25_stop_t1timer(ax25); - ax25_stop_t2timer(ax25); - ax25_stop_t3timer(ax25); - ax25_stop_idletimer(ax25); + if (reason == ENETUNREACH) { + del_timer_sync(&ax25->timer); + del_timer_sync(&ax25->t1timer); + del_timer_sync(&ax25->t2timer); + del_timer_sync(&ax25->t3timer); + del_timer_sync(&ax25->idletimer); + } else { + if (!ax25->sk || !sock_flag(ax25->sk, SOCK_DESTROY)) + ax25_stop_heartbeat(ax25); + ax25_stop_t1timer(ax25); + ax25_stop_t2timer(ax25); + ax25_stop_t3timer(ax25); + ax25_stop_idletimer(ax25); + } ax25->state = AX25_STATE_0; -- cgit v1.2.3 From 8d3ea3d402db94b61075617e71b67459a714a502 Mon Sep 17 00:00:00 2001 From: Jeremy Linton Date: Wed, 9 Mar 2022 22:53:58 -0600 Subject: net: bcmgenet: Use stronger register read/writes to assure ordering GCC12 appears to be much smarter about its dependency tracking and is aware that the relaxed variants are just normal loads and stores and this is causing problems like: [ 210.074549] ------------[ cut here ]------------ [ 210.079223] NETDEV WATCHDOG: enabcm6e4ei0 (bcmgenet): transmit queue 1 timed out [ 210.086717] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:529 dev_watchdog+0x234/0x240 [ 210.095044] Modules linked in: genet(E) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat] [ 210.146561] ACPI CPPC: PCC check channel failed for ss: 0. ret=-110 [ 210.146927] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G E 5.17.0-rc7G12+ #58 [ 210.153226] CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters [ 210.161349] Hardware name: Raspberry Pi Foundation Raspberry Pi 4 Model B/Raspberry Pi 4 Model B, BIOS EDK2-DEV 02/08/2022 [ 210.161353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 210.161358] pc : dev_watchdog+0x234/0x240 [ 210.161364] lr : dev_watchdog+0x234/0x240 [ 210.161368] sp : ffff8000080a3a40 [ 210.161370] x29: ffff8000080a3a40 x28: ffffcd425af87000 x27: ffff8000080a3b20 [ 210.205150] x26: ffffcd425aa00000 x25: 0000000000000001 x24: ffffcd425af8ec08 [ 210.212321] x23: 0000000000000100 x22: ffffcd425af87000 x21: ffff55b142688000 [ 210.219491] x20: 0000000000000001 x19: ffff55b1426884c8 x18: ffffffffffffffff [ 210.226661] x17: 64656d6974203120 x16: 0000000000000001 x15: 6d736e617274203a [ 210.233831] x14: 2974656e65676d63 x13: ffffcd4259c300d8 x12: ffffcd425b07d5f0 [ 210.241001] x11: 00000000ffffffff x10: ffffcd425b07d5f0 x9 : ffffcd4258bdad9c [ 210.248171] x8 : 00000000ffffdfff x7 : 000000000000003f x6 : 0000000000000000 [ 210.255341] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000001000 [ 210.262511] x2 : 0000000000001000 x1 : 0000000000000005 x0 : 0000000000000044 [ 210.269682] Call trace: [ 210.272133] dev_watchdog+0x234/0x240 [ 210.275811] call_timer_fn+0x3c/0x15c [ 210.279489] __run_timers.part.0+0x288/0x310 [ 210.283777] run_timer_softirq+0x48/0x80 [ 210.287716] __do_softirq+0x128/0x360 [ 210.291392] __irq_exit_rcu+0x138/0x140 [ 210.295243] irq_exit_rcu+0x1c/0x30 [ 210.298745] el1_interrupt+0x38/0x54 [ 210.302334] el1h_64_irq_handler+0x18/0x24 [ 210.306445] el1h_64_irq+0x7c/0x80 [ 210.309857] arch_cpu_idle+0x18/0x2c [ 210.313445] default_idle_call+0x4c/0x140 [ 210.317470] cpuidle_idle_call+0x14c/0x1a0 [ 210.321584] do_idle+0xb0/0x100 [ 210.324737] cpu_startup_entry+0x30/0x8c [ 210.328675] secondary_start_kernel+0xe4/0x110 [ 210.333138] __secondary_switched+0x94/0x98 The assumption when these were relaxed seems to be that device memory would be mapped non reordering, and that other constructs (spinlocks/etc) would provide the barriers to assure that packet data and in memory rings/queues were ordered with respect to device register reads/writes. This itself seems a bit sketchy, but the real problem with GCC12 is that it is moving the actual reads/writes around at will as though they were independent operations when in truth they are not, but the compiler can't know that. When looking at the assembly dumps for many of these routines its possible to see very clean, but not strictly in program order operations occurring as the compiler would be free to do if these weren't actually register reads/write operations. Its possible to suppress the timeout with a liberal bit of dma_mb()'s sprinkled around but the device still seems unable to reliably send/receive data. A better plan is to use the safer readl/writel everywhere. Since this partially reverts an older commit, which notes the use of the relaxed variants for performance reasons. I would suggest that any performance problems with this commit are targeted at relaxing only the performance critical code paths after assuring proper barriers. Fixes: 69d2ea9c79898 ("net: bcmgenet: Use correct I/O accessors") Reported-by: Peter Robinson Signed-off-by: Jeremy Linton Acked-by: Peter Robinson Tested-by: Peter Robinson Acked-by: Florian Fainelli Link: https://lore.kernel.org/r/20220310045358.224350-1-jeremy.linton@arm.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c index 2da804f84b48..bd5998012a87 100644 --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c @@ -76,7 +76,7 @@ static inline void bcmgenet_writel(u32 value, void __iomem *offset) if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) __raw_writel(value, offset); else - writel_relaxed(value, offset); + writel(value, offset); } static inline u32 bcmgenet_readl(void __iomem *offset) @@ -84,7 +84,7 @@ static inline u32 bcmgenet_readl(void __iomem *offset) if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) return __raw_readl(offset); else - return readl_relaxed(offset); + return readl(offset); } static inline void dmadesc_set_length_status(struct bcmgenet_priv *priv, -- cgit v1.2.3 From ed0c99dc0f499ff8b6e75b5ae6092ab42be1ad39 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Mon, 21 Mar 2022 09:59:57 -0700 Subject: tcp: ensure PMTU updates are processed during fastopen tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we rise it to the local MSS. tp->mss_cache is not updated, however: tcp_v6_connect(): tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers; tcp_connect(): tcp_connect_init(): tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp) tcp_send_syn_data(): tp->rx_opt.mss_clamp = tp->advmss After recent fixes to ICMPv6 PTB handling we started dropping PMTU updates higher than tp->mss_cache. Because of the stale tp->mss_cache value PMTU updates during TFO are always dropped. Thanks to Wei for helping zero in on the problem and the fix! Fixes: c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages") Reported-by: Andre Nash Reported-by: Neil Spring Reviewed-by: Wei Wang Acked-by: Yuchung Cheng Acked-by: Martin KaFai Lau Signed-off-by: Jakub Kicinski Reviewed-by: Eric Dumazet Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_output.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5079832af5c1..257780f93305 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3719,6 +3719,7 @@ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb) */ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) { + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct tcp_fastopen_request *fo = tp->fastopen_req; int space, err = 0; @@ -3733,8 +3734,10 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) * private TCP options. The cost is reduced data space in SYN :( */ tp->rx_opt.mss_clamp = tcp_mss_clamp(tp, tp->rx_opt.mss_clamp); + /* Sync mss_cache after updating the mss_clamp */ + tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); - space = __tcp_mtu_to_mss(sk, inet_csk(sk)->icsk_pmtu_cookie) - + space = __tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) - MAX_TCP_OPTION_SPACE; space = min_t(size_t, space, fo->size); -- cgit v1.2.3 From 60b44ca6bd7518dd38fa2719bc9240378b6172c3 Mon Sep 17 00:00:00 2001 From: Aaron Conole Date: Fri, 18 Mar 2022 08:43:19 -0400 Subject: openvswitch: always update flow key after nat During NAT, a tuple collision may occur. When this happens, openvswitch will make a second pass through NAT which will perform additional packet modification. This will update the skb data, but not the flow key that OVS uses. This means that future flow lookups, and packet matches will have incorrect data. This has been supported since 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack"). That commit failed to properly update the sw_flow_key attributes, since it only called the ovs_ct_nat_update_key once, rather than each time ovs_ct_nat_execute was called. As these two operations are linked, the ovs_ct_nat_execute() function should always make sure that the sw_flow_key is updated after a successful call through NAT infrastructure. Fixes: 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack") Cc: Dumitru Ceara Cc: Numan Siddique Signed-off-by: Aaron Conole Acked-by: Eelco Chaudron Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com Signed-off-by: Jakub Kicinski --- net/openvswitch/conntrack.c | 118 ++++++++++++++++++++++---------------------- 1 file changed, 59 insertions(+), 59 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index c07afff57dd3..4a947c13c813 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -734,6 +734,57 @@ static bool skb_nfct_cached(struct net *net, } #if IS_ENABLED(CONFIG_NF_NAT) +static void ovs_nat_update_key(struct sw_flow_key *key, + const struct sk_buff *skb, + enum nf_nat_manip_type maniptype) +{ + if (maniptype == NF_NAT_MANIP_SRC) { + __be16 src; + + key->ct_state |= OVS_CS_F_SRC_NAT; + if (key->eth.type == htons(ETH_P_IP)) + key->ipv4.addr.src = ip_hdr(skb)->saddr; + else if (key->eth.type == htons(ETH_P_IPV6)) + memcpy(&key->ipv6.addr.src, &ipv6_hdr(skb)->saddr, + sizeof(key->ipv6.addr.src)); + else + return; + + if (key->ip.proto == IPPROTO_UDP) + src = udp_hdr(skb)->source; + else if (key->ip.proto == IPPROTO_TCP) + src = tcp_hdr(skb)->source; + else if (key->ip.proto == IPPROTO_SCTP) + src = sctp_hdr(skb)->source; + else + return; + + key->tp.src = src; + } else { + __be16 dst; + + key->ct_state |= OVS_CS_F_DST_NAT; + if (key->eth.type == htons(ETH_P_IP)) + key->ipv4.addr.dst = ip_hdr(skb)->daddr; + else if (key->eth.type == htons(ETH_P_IPV6)) + memcpy(&key->ipv6.addr.dst, &ipv6_hdr(skb)->daddr, + sizeof(key->ipv6.addr.dst)); + else + return; + + if (key->ip.proto == IPPROTO_UDP) + dst = udp_hdr(skb)->dest; + else if (key->ip.proto == IPPROTO_TCP) + dst = tcp_hdr(skb)->dest; + else if (key->ip.proto == IPPROTO_SCTP) + dst = sctp_hdr(skb)->dest; + else + return; + + key->tp.dst = dst; + } +} + /* Modelled after nf_nat_ipv[46]_fn(). * range is only used for new, uninitialized NAT state. * Returns either NF_ACCEPT or NF_DROP. @@ -741,7 +792,7 @@ static bool skb_nfct_cached(struct net *net, static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, const struct nf_nat_range2 *range, - enum nf_nat_manip_type maniptype) + enum nf_nat_manip_type maniptype, struct sw_flow_key *key) { int hooknum, nh_off, err = NF_ACCEPT; @@ -813,58 +864,11 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct, push: skb_push_rcsum(skb, nh_off); - return err; -} - -static void ovs_nat_update_key(struct sw_flow_key *key, - const struct sk_buff *skb, - enum nf_nat_manip_type maniptype) -{ - if (maniptype == NF_NAT_MANIP_SRC) { - __be16 src; - - key->ct_state |= OVS_CS_F_SRC_NAT; - if (key->eth.type == htons(ETH_P_IP)) - key->ipv4.addr.src = ip_hdr(skb)->saddr; - else if (key->eth.type == htons(ETH_P_IPV6)) - memcpy(&key->ipv6.addr.src, &ipv6_hdr(skb)->saddr, - sizeof(key->ipv6.addr.src)); - else - return; - - if (key->ip.proto == IPPROTO_UDP) - src = udp_hdr(skb)->source; - else if (key->ip.proto == IPPROTO_TCP) - src = tcp_hdr(skb)->source; - else if (key->ip.proto == IPPROTO_SCTP) - src = sctp_hdr(skb)->source; - else - return; - - key->tp.src = src; - } else { - __be16 dst; - - key->ct_state |= OVS_CS_F_DST_NAT; - if (key->eth.type == htons(ETH_P_IP)) - key->ipv4.addr.dst = ip_hdr(skb)->daddr; - else if (key->eth.type == htons(ETH_P_IPV6)) - memcpy(&key->ipv6.addr.dst, &ipv6_hdr(skb)->daddr, - sizeof(key->ipv6.addr.dst)); - else - return; - - if (key->ip.proto == IPPROTO_UDP) - dst = udp_hdr(skb)->dest; - else if (key->ip.proto == IPPROTO_TCP) - dst = tcp_hdr(skb)->dest; - else if (key->ip.proto == IPPROTO_SCTP) - dst = sctp_hdr(skb)->dest; - else - return; + /* Update the flow key if NAT successful. */ + if (err == NF_ACCEPT) + ovs_nat_update_key(key, skb, maniptype); - key->tp.dst = dst; - } + return err; } /* Returns NF_DROP if the packet should be dropped, NF_ACCEPT otherwise. */ @@ -906,7 +910,7 @@ static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, } else { return NF_ACCEPT; /* Connection is not NATed. */ } - err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, maniptype); + err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, maniptype, key); if (err == NF_ACCEPT && ct->status & IPS_DST_NAT) { if (ct->status & IPS_SRC_NAT) { @@ -916,17 +920,13 @@ static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, maniptype = NF_NAT_MANIP_SRC; err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, - maniptype); + maniptype, key); } else if (CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) { err = ovs_ct_nat_execute(skb, ct, ctinfo, NULL, - NF_NAT_MANIP_SRC); + NF_NAT_MANIP_SRC, key); } } - /* Mark NAT done if successful and update the flow key. */ - if (err == NF_ACCEPT) - ovs_nat_update_key(key, skb, maniptype); - return err; } #else /* !CONFIG_NF_NAT */ -- cgit v1.2.3 From 8fd36358ce82382519b50b05f437493e1e00c4a9 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Fri, 18 Mar 2022 21:54:43 +0200 Subject: net: dsa: fix panic on shutdown if multi-chip tree failed to probe DSA probing is atypical because a tree of devices must probe all at once, so out of N switches which call dsa_tree_setup_routing_table() during probe, for (N - 1) of them, "complete" will return false and they will exit probing early. The Nth switch will set up the whole tree on their behalf. The implication is that for (N - 1) switches, the driver binds to the device successfully, without doing anything. When the driver is bound, the ->shutdown() method may run. But if the Nth switch has failed to initialize the tree, there is nothing to do for the (N - 1) driver instances, since the slave devices have not been created, etc. Moreover, dsa_switch_shutdown() expects that the calling @ds has been in fact initialized, so it jumps at dereferencing the various data structures, which is incorrect. Avoid the ensuing NULL pointer dereferences by simply checking whether the Nth switch has previously set "ds->setup = true" for the switch which is currently shutting down. The entire setup is serialized under dsa2_mutex which we already hold. Fixes: 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown") Signed-off-by: Vladimir Oltean Link: https://lore.kernel.org/r/20220318195443.275026-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- net/dsa/dsa2.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index 88e2808019b4..a39bbed77f87 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -1722,6 +1722,10 @@ void dsa_switch_shutdown(struct dsa_switch *ds) struct dsa_port *dp; mutex_lock(&dsa2_mutex); + + if (!ds->setup) + goto out; + rtnl_lock(); dsa_switch_for_each_user_port(dp, ds) { @@ -1738,6 +1742,7 @@ void dsa_switch_shutdown(struct dsa_switch *ds) dp->master->dsa_ptr = NULL; rtnl_unlock(); +out: mutex_unlock(&dsa2_mutex); } EXPORT_SYMBOL_GPL(dsa_switch_shutdown); -- cgit v1.2.3 From 6b3c74550224c3be24c4cf6ab8c333602b458bff Mon Sep 17 00:00:00 2001 From: Yang Yingliang Date: Sat, 19 Mar 2022 11:24:50 +0800 Subject: net: wwan: qcom_bam_dmux: fix wrong pointer passed to IS_ERR() It should check dmux->tx after calling dma_request_chan(). Fixes: 21a0ffd9b38c ("net: wwan: Add Qualcomm BAM-DMUX WWAN network driver") Signed-off-by: Yang Yingliang Reviewed-by: Stephan Gerhold Link: https://lore.kernel.org/r/20220319032450.3288224-1-yangyingliang@huawei.com Signed-off-by: Paolo Abeni --- drivers/net/wwan/qcom_bam_dmux.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wwan/qcom_bam_dmux.c b/drivers/net/wwan/qcom_bam_dmux.c index 5dfa2eba6014..17d46f4d2913 100644 --- a/drivers/net/wwan/qcom_bam_dmux.c +++ b/drivers/net/wwan/qcom_bam_dmux.c @@ -755,7 +755,7 @@ static int __maybe_unused bam_dmux_runtime_resume(struct device *dev) return 0; dmux->tx = dma_request_chan(dev, "tx"); - if (IS_ERR(dmux->rx)) { + if (IS_ERR(dmux->tx)) { dev_err(dev, "Failed to request TX DMA channel: %pe\n", dmux->tx); dmux->tx = NULL; bam_dmux_runtime_suspend(dev); -- cgit v1.2.3 From 6a7d8cff4a3301087dd139293e9bddcf63827282 Mon Sep 17 00:00:00 2001 From: Hoang Le Date: Mon, 21 Mar 2022 11:22:29 +0700 Subject: tipc: fix the timer expires after interval 100ms In the timer callback function tipc_sk_timeout(), we're trying to reschedule another timeout to retransmit a setup request if destination link is congested. But we use the incorrect timeout value (msecs_to_jiffies(100)) instead of (jiffies + msecs_to_jiffies(100)), so that the timer expires immediately, it's irrelevant for original description. In this commit we correct the timeout value in sk_reset_timer() Fixes: 6787927475e5 ("tipc: buffer overflow handling in listener socket") Acked-by: Ying Xue Signed-off-by: Hoang Le Link: https://lore.kernel.org/r/20220321042229.314288-1-hoang.h.le@dektech.com.au Signed-off-by: Paolo Abeni --- net/tipc/socket.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 7545321c3440..17f8c523e33b 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -2852,7 +2852,8 @@ static void tipc_sk_retry_connect(struct sock *sk, struct sk_buff_head *list) /* Try again later if dest link is congested */ if (tsk->cong_link_cnt) { - sk_reset_timer(sk, &sk->sk_timer, msecs_to_jiffies(100)); + sk_reset_timer(sk, &sk->sk_timer, + jiffies + msecs_to_jiffies(100)); return; } /* Prepare SYN for retransmit */ -- cgit v1.2.3 From 32d53c0aa3a7b727243473949bad2a830b908edc Mon Sep 17 00:00:00 2001 From: Alexander Lobakin Date: Wed, 23 Mar 2022 13:43:52 +0100 Subject: ice: fix 'scheduling while atomic' on aux critical err interrupt There's a kernel BUG splat on processing aux critical error interrupts in ice_misc_intr(): [ 2100.917085] BUG: scheduling while atomic: swapper/15/0/0x00010000 ... [ 2101.060770] Call Trace: [ 2101.063229] [ 2101.065252] dump_stack+0x41/0x60 [ 2101.068587] __schedule_bug.cold.100+0x4c/0x58 [ 2101.073060] __schedule+0x6a4/0x830 [ 2101.076570] schedule+0x35/0xa0 [ 2101.079727] schedule_preempt_disabled+0xa/0x10 [ 2101.084284] __mutex_lock.isra.7+0x310/0x420 [ 2101.088580] ? ice_misc_intr+0x201/0x2e0 [ice] [ 2101.093078] ice_send_event_to_aux+0x25/0x70 [ice] [ 2101.097921] ice_misc_intr+0x220/0x2e0 [ice] [ 2101.102232] __handle_irq_event_percpu+0x40/0x180 [ 2101.106965] handle_irq_event_percpu+0x30/0x80 [ 2101.111434] handle_irq_event+0x36/0x53 [ 2101.115292] handle_edge_irq+0x82/0x190 [ 2101.119148] handle_irq+0x1c/0x30 [ 2101.122480] do_IRQ+0x49/0xd0 [ 2101.125465] common_interrupt+0xf/0xf [ 2101.129146] ... As Andrew correctly mentioned previously[0], the following call ladder happens: ice_misc_intr() <- hardirq ice_send_event_to_aux() device_lock() mutex_lock() might_sleep() might_resched() <- oops Add a new PF state bit which indicates that an aux critical error occurred and serve it in ice_service_task() in process context. The new ice_pf::oicr_err_reg is read-write in both hardirq and process contexts, but only 3 bits of non-critical data probably aren't worth explicit synchronizing (and they're even in the same byte [31:24]). [0] https://lore.kernel.org/all/YeSRUVmrdmlUXHDn@lunn.ch Fixes: 348048e724a0e ("ice: Implement iidc operations") Signed-off-by: Alexander Lobakin Tested-by: Michal Kubiak Acked-by: Tony Nguyen Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice.h | 2 ++ drivers/net/ethernet/intel/ice/ice_main.c | 25 +++++++++++++++---------- 2 files changed, 17 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index bea1d1e39fa2..2ca887076dd4 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -290,6 +290,7 @@ enum ice_pf_state { ICE_LINK_DEFAULT_OVERRIDE_PENDING, ICE_PHY_INIT_COMPLETE, ICE_FD_VF_FLUSH_CTX, /* set at FD Rx IRQ or timeout */ + ICE_AUX_ERR_PENDING, ICE_STATE_NBITS /* must be last */ }; @@ -559,6 +560,7 @@ struct ice_pf { wait_queue_head_t reset_wait_queue; u32 hw_csum_rx_error; + u32 oicr_err_reg; u16 oicr_idx; /* Other interrupt cause MSIX vector index */ u16 num_avail_sw_msix; /* remaining MSIX SW vectors left unclaimed */ u16 max_pf_txqs; /* Total Tx queues PF wide */ diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index b7e8744b0c0a..296f9d5f7408 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2255,6 +2255,19 @@ static void ice_service_task(struct work_struct *work) return; } + if (test_and_clear_bit(ICE_AUX_ERR_PENDING, pf->state)) { + struct iidc_event *event; + + event = kzalloc(sizeof(*event), GFP_KERNEL); + if (event) { + set_bit(IIDC_EVENT_CRIT_ERR, event->type); + /* report the entire OICR value to AUX driver */ + swap(event->reg, pf->oicr_err_reg); + ice_send_event_to_aux(pf, event); + kfree(event); + } + } + if (test_bit(ICE_FLAG_PLUG_AUX_DEV, pf->flags)) { /* Plug aux device per request */ ice_plug_aux_dev(pf); @@ -3041,17 +3054,9 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data) #define ICE_AUX_CRIT_ERR (PFINT_OICR_PE_CRITERR_M | PFINT_OICR_HMC_ERR_M | PFINT_OICR_PE_PUSH_M) if (oicr & ICE_AUX_CRIT_ERR) { - struct iidc_event *event; - + pf->oicr_err_reg |= oicr; + set_bit(ICE_AUX_ERR_PENDING, pf->state); ena_mask &= ~ICE_AUX_CRIT_ERR; - event = kzalloc(sizeof(*event), GFP_ATOMIC); - if (event) { - set_bit(IIDC_EVENT_CRIT_ERR, event->type); - /* report the entire OICR value to AUX driver */ - event->reg = oicr; - ice_send_event_to_aux(pf, event); - kfree(event); - } } /* Report any remaining unexpected interrupts */ -- cgit v1.2.3 From 5a3156932da06f09953764de113419f254086faf Mon Sep 17 00:00:00 2001 From: Alexander Lobakin Date: Wed, 23 Mar 2022 13:43:53 +0100 Subject: ice: don't allow to run ice_send_event_to_aux() in atomic ctx ice_send_event_to_aux() eventually descends to mutex_lock() (-> might_sched()), so it must not be called under non-task context. However, at least two fixes have happened already for the bug splats occurred due to this function being called from atomic context. To make the emergency landings softer, bail out early when executed in non-task context emitting a warn splat only once. This way we trade some events being potentially lost for system stability and avoid any related hangs and crashes. Fixes: 348048e724a0e ("ice: Implement iidc operations") Signed-off-by: Alexander Lobakin Tested-by: Michal Kubiak Reviewed-by: Maciej Fijalkowski Acked-by: Tony Nguyen Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_idc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_idc.c b/drivers/net/ethernet/intel/ice/ice_idc.c index fc3580167e7b..5559230eff8b 100644 --- a/drivers/net/ethernet/intel/ice/ice_idc.c +++ b/drivers/net/ethernet/intel/ice/ice_idc.c @@ -34,6 +34,9 @@ void ice_send_event_to_aux(struct ice_pf *pf, struct iidc_event *event) { struct iidc_auxiliary_drv *iadrv; + if (WARN_ON_ONCE(!in_task())) + return; + if (!pf->adev) return; -- cgit v1.2.3