summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2020-11-03sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platformsPetr Malat1-2/+2
Commit 978aa0474115 ("sctp: fix some type cast warnings introduced since very beginning")' broke err reading from sctp_arg, because it reads the value as 32-bit integer, although the value is stored as 16-bit integer. Later this value is passed to the userspace in 16-bit variable, thus the user always gets 0 on big-endian platforms. Fix it by reading the __u16 field of sctp_arg union, as reading err field would produce a sparse warning. Fixes: 978aa0474115 ("sctp: fix some type cast warnings introduced since very beginning") Signed-off-by: Petr Malat <oss@malat.biz> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/20201030132633.7045-1-oss@malat.biz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02Merge tag 'mac80211-for-net-2020-10-30' of ↵Jakub Kicinski8-48/+93
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A couple of fixes, for * HE on 2.4 GHz * a few issues syzbot found, but we have many more reports :-( * a regression in nl80211-transported EAPOL frames which had affected a number of users, from Mathy * kernel-doc markings in mac80211, from Mauro * a format argument in reg.c, from Ye Bin ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski13-30/+52
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Incorrect netlink report logic in flowtable and genID. 2) Add a selftest to check that wireguard passes the right sk to ip_route_me_harder, from Jason A. Donenfeld. 3) Pass the actual sk to ip_route_me_harder(), also from Jason. 4) Missing expression validation of updates via nft --check. 5) Update byte and packet counters regardless of whether they match, from Stefano Brivio. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-01ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flagswenxu1-3/+0
The tunnel device such as vxlan, bareudp and geneve in the lwt mode set the outer df only based TUNNEL_DONT_FRAGMENT. And this was also the behavior for gre device before switching to use ip_md_tunnel_xmit in commit 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit") When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and make the discrepancy between handling of DF by different tunnels. So in the ip_md_tunnel_xmit should follow the same rule like other tunnels. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Signed-off-by: wenxu <wenxu@ucloud.cn> Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-01Merge tag 'flexible-array-conversions-5.10-rc2' of ↵Linus Torvalds2-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull more flexible-array member conversions from Gustavo A. R. Silva: "Replace zero-length arrays with flexible-array members" * tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: printk: ringbuffer: Replace zero-length array with flexible-array member net/smc: Replace zero-length array with flexible-array member net/mlx5: Replace zero-length array with flexible-array member mei: hw: Replace zero-length array with flexible-array member gve: Replace zero-length array with flexible-array member Bluetooth: btintel: Replace zero-length array with flexible-array member scsi: target: tcmu: Replace zero-length array with flexible-array member ima: Replace zero-length array with flexible-array member enetc: Replace zero-length array with flexible-array member fs: Replace zero-length array with flexible-array member Bluetooth: Replace zero-length array with flexible-array member params: Replace zero-length array with flexible-array member tracepoint: Replace zero-length array with flexible-array member platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
2020-10-31IPv6: reply ICMP error if the first fragment don't include all headersHangbin Liu2-2/+39
Based on RFC 8200, Section 4.5 Fragment Header: - If the first fragment does not include all headers through an Upper-Layer header, then that fragment should be discarded and an ICMP Parameter Problem, Code 3, message should be sent to the source of the fragment, with the Pointer field set to zero. Checking each packet header in IPv6 fast path will have performance impact, so I put the checking in ipv6_frag_rcv(). As the packet may be any kind of L4 protocol, I only checked some common protocols' header length and handle others by (offset + 1) > skb->len. Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments (fragmented packet with only one fragment). When send ICMP error message, if the 1st truncated fragment is ICMP message, icmp6_send() will break as is_ineligible() return true. So I added a check in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header return false. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31net: atm: fix update of position index in lec_seq_nextColin Ian King1-3/+2
The position index in leq_seq_next is not updated when the next entry is fetched an no more entries are available. This causes seq_file to report the following error: "seq_file: buggy .next function lec_seq_next [lec] did not update position index" Fix this by always updating the position index. [ Note: this is an ancient 2002 bug, the sha is from the tglx/history repo ] Fixes 4aea2cbff417 ("[ATM]: Move lan seq_file ops to lec.c [1/3]") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20201027114925.21843-1-colin.king@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31netfilter: ipset: Update byte and packet counters regardless of whether they ↵Stefano Brivio1-1/+2
match In ip_set_match_extensions(), for sets with counters, we take care of updating counters themselves by calling ip_set_update_counter(), and of checking if the given comparison and values match, by calling ip_set_match_counter() if needed. However, if a given comparison on counters doesn't match the configured values, that doesn't mean the set entry itself isn't matching. This fix restores the behaviour we had before commit 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching"), without reintroducing the issue fixed there: back then, mtype_data_match() first updated counters in any case, and then took care of matching on counters. Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set, ip_set_update_counter() will anyway skip counter updates if desired. The issue observed is illustrated by this reproducer: ipset create c hash:ip counters ipset add c 192.0.2.1 iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP if we now send packets from 192.0.2.1, bytes and packets counters for the entry as shown by 'ipset list' are always zero, and, no matter how many bytes we send, the rule will never match, because counters themselves are not updated. Reported-by: Mithil Mhatre <mmhatre@redhat.com> Fixes: 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-31net/smc: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-2/+2
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-30ip6_tunnel: set inner ipproto before ip6_tnl_encapAlexander Ovechkin1-2/+2
ip6_tnl_encap assigns to proto transport protocol which encapsulates inner packet, but we must pass to set_inner_ipproto protocol of that inner packet. Calling set_inner_ipproto after ip6_tnl_encap might break gso. For example, in case of encapsulating ipv6 packet in fou6 packet, inner_ipproto would be set to IPPROTO_UDP instead of IPPROTO_IPV6. This would lead to incorrect calling sequence of gso functions: ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> udp6_ufo_fragment instead of: ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> ip6ip6_gso_segment Fixes: 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support") Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Link: https://lore.kernel.org/r/20201029171012.20904-1-ovov@yandex-team.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30netfilter: nf_tables: missing validation from the abort pathPablo Neira Ayuso2-9/+28
If userspace does not include the trailing end of batch message, then nfnetlink aborts the transaction. This allows to check that ruleset updates trigger no errors. After this patch, invoking this command from the prerouting chain: # nft -c add rule x y fib saddr . oif type local fails since oif is not supported there. This patch fixes the lack of rule validation from the abort/check path to catch configuration errors such as the one above. Fixes: a654de8fdc18 ("netfilter: nf_tables: fix chain dependency validation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30netfilter: use actual socket sk rather than skb sk when routing harderJason A. Donenfeld10-18/+20
If netfilter changes the packet mark when mangling, the packet is rerouted using the route_me_harder set of functions. Prior to this commit, there's one big difference between route_me_harder and the ordinary initial routing functions, described in the comment above __ip_queue_xmit(): /* Note: skb->sk can be different from sk, in case of tunnels */ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl, That function goes on to correctly make use of sk->sk_bound_dev_if, rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a tunnel will receive a packet in ndo_start_xmit with an initial skb->sk. It will make some transformations to that packet, and then it will send the encapsulated packet out of a *new* socket. That new socket will basically always have a different sk_bound_dev_if (otherwise there'd be a routing loop). So for the purposes of routing the encapsulated packet, the routing information as it pertains to the socket should come from that socket's sk, rather than the packet's original skb->sk. For that reason __ip_queue_xmit() and related functions all do the right thing. One might argue that all tunnels should just call skb_orphan(skb) before transmitting the encapsulated packet into the new socket. But tunnels do *not* do this -- and this is wisely avoided in skb_scrub_packet() too -- because features like TSQ rely on skb->destructor() being called when that buffer space is truely available again. Calling skb_orphan(skb) too early would result in buffers filling up unnecessarily and accounting info being all wrong. Instead, additional routing must take into account the new sk, just as __ip_queue_xmit() notes. So, this commit addresses the problem by fishing the correct sk out of state->sk -- it's already set properly in the call to nf_hook() in __ip_local_out(), which receives the sk as part of its normal functionality. So we make sure to plumb state->sk through the various route_me_harder functions, and then make correct use of it following the example of __ip_queue_xmit(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30netfilter: nftables: fix netlink report logic in flowtable and genidPablo Neira Ayuso1-2/+2
The netlink report should be sent regardless the available listeners. Fixes: 84d7fce69388 ("netfilter: nf_tables: export rule-set generation ID") Fixes: 3b49e2e94e6e ("netfilter: nf_tables: add flow table netlink frontend") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30mac80211: don't require VHT elements for HE on 2.4 GHzJohannes Berg1-1/+2
After the previous similar bugfix there was another bug here, if no VHT elements were found we also disabled HE. Fix this to disable HE only on the 5 GHz band; on 6 GHz it was already not disabled, and on 2.4 GHz there need (should) not be any VHT. Fixes: 57fa5e85d53c ("mac80211: determine chandef from HE 6 GHz operation") Link: https://lore.kernel.org/r/20201013140156.535a2fc6192f.Id6e5e525a60ac18d245d86f4015f1b271fce6ee6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30cfg80211: regulatory: Fix inconsistent format argumentYe Bin1-1/+1
Fix follow warning: [net/wireless/reg.c:3619]: (warning) %d in format string (no. 2) requires 'int' but the argument type is 'unsigned int'. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20201009070215.63695-1-yebin10@huawei.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30mac80211: fix kernel-doc markupsMauro Carvalho Chehab1-1/+8
Some identifiers have different names between their prototypes and the kernel-doc markup. Others need to be fixed, as kernel-doc markups should use this format: identifier - description In the specific case of __sta_info_flush(), add a documentation for sta_info_flush(), as this one is the one used outside sta_info.c. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/978d35eef2dc76e21c81931804e4eaefbd6d635e.1603469755.git.mchehab+huawei@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30mac80211: always wind down STA stateJohannes Berg1-0/+18
When (for example) an IBSS station is pre-moved to AUTHORIZED before it's inserted, and then the insertion fails, we don't clean up the fast RX/TX states that might already have been created, since we don't go through all the state transitions again on the way down. Do that, if it hasn't been done already, when the station is freed. I considered only freeing the fast TX/RX state there, but we might add more state so it's more robust to wind down the state properly. Note that we warn if the station was ever inserted, it should have been properly cleaned up in that case, and the driver will probably not like things happening out of order. Reported-by: syzbot+2e293dbd67de2836ba42@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20201009141710.7223b322a955.I95bd08b9ad0e039c034927cce0b75beea38e059b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30cfg80211: initialize wdev data earlierJohannes Berg3-29/+36
There's a race condition in the netdev registration in that NETDEV_REGISTER actually happens after the netdev is available, and so if we initialize things only there, we might get called with an uninitialized wdev through nl80211 - not using a wdev but using a netdev interface index. I found this while looking into a syzbot report, but it doesn't really seem to be related, and unfortunately there's no repro for it (yet). I can't (yet) explain how it managed to get into cfg80211_release_pmsr() from nl80211_netlink_notify() without the wdev having been initialized, as the latter only iterates the wdevs that are linked into the rdev, which even without the change here happened after init. However, looking at this, it seems fairly clear that the init needs to be done earlier, otherwise we might even re-init on a netns move, when data might still be pending. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20201009135821.fdcbba3aad65.Ie9201d91dbcb7da32318812effdc1561aeaf4cdc@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30mac80211: fix use of skb payload instead of headerJohannes Berg1-13/+24
When ieee80211_skb_resize() is called from ieee80211_build_hdr() the skb has no 802.11 header yet, in fact it consist only of the payload as the ethernet frame is removed. As such, we're using the payload data for ieee80211_is_mgmt(), which is of course completely wrong. This didn't really hurt us because these are always data frames, so we could only have added more tailroom than we needed if we determined it was a management frame and sdata->crypto_tx_tailroom_needed_cnt was false. However, syzbot found that of course there need not be any payload, so we're using at best uninitialized memory for the check. Fix this to pass explicitly the kind of frame that we have instead of checking there, by replacing the "bool may_encrypt" argument with an argument that can carry the three possible states - it's not going to be encrypted, it's a management frame, or it's a data frame (and then we check sdata->crypto_tx_tailroom_needed_cnt). Reported-by: syzbot+32fd1a1bfe355e93f1e2@syzkaller.appspotmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20201009132538.e1fd7f802947.I799b288466ea2815f9d4c84349fae697dca2f189@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30mac80211: fix regression where EAPOL frames were sent in plaintextMathy Vanhoef1-3/+4
When sending EAPOL frames via NL80211 they are treated as injected frames in mac80211. Due to commit 1df2bdba528b ("mac80211: never drop injected frames even if normally not allowed") these injected frames were not assigned a sta context in the function ieee80211_tx_dequeue, causing certain wireless network cards to always send EAPOL frames in plaintext. This may cause compatibility issues with some clients or APs, which for instance can cause the group key handshake to fail and in turn would cause the station to get disconnected. This commit fixes this regression by assigning a sta context in ieee80211_tx_dequeue to injected frames as well. Note that sending EAPOL frames in plaintext is not a security issue since they contain their own encryption and authentication protection. Cc: stable@vger.kernel.org Fixes: 1df2bdba528b ("mac80211: never drop injected frames even if normally not allowed") Reported-by: Thomas Deutschmann <whissi@gentoo.org> Tested-by: Christian Hesse <list@eworm.de> Tested-by: Thomas Deutschmann <whissi@gentoo.org> Signed-off-by: Mathy Vanhoef <Mathy.Vanhoef@kuleuven.be> Link: https://lore.kernel.org/r/20201019160113.350912-1-Mathy.Vanhoef@kuleuven.be Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30Bluetooth: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+2
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29Merge tag 'net-5.10-rc2' of ↵Linus Torvalds11-24/+56
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Current release regressions: - r8169: fix forced threading conflicting with other shared interrupts; we tried to fix the use of raise_softirq_irqoff from an IRQ handler on RT by forcing hard irqs, but this driver shares legacy PCI IRQs so drop the _irqoff() instead - tipc: fix memory leak caused by a recent syzbot report fix to tipc_buf_append() Current release - bugs in new features: - devlink: Unlock on error in dumpit() and fix some error codes - net/smc: fix null pointer dereference in smc_listen_decline() Previous release - regressions: - tcp: Prevent low rmem stalls with SO_RCVLOWAT. - net: protect tcf_block_unbind with block lock - ibmveth: Fix use of ibmveth in a bridge; the self-imposed filtering to only send legal frames to the hypervisor was too strict - net: hns3: Clear the CMDQ registers before unmapping BAR region; incorrect cleanup order was leading to a crash - bnxt_en - handful of fixes to fixes: - Send HWRM_FUNC_RESET fw command unconditionally, even if there are PCIe errors being reported - Check abort error state in bnxt_open_nic(). - Invoke cancel_delayed_work_sync() for PFs also. - Fix regression in workqueue cleanup logic in bnxt_remove_one(). - mlxsw: Only advertise link modes supported by both driver and device, after removal of 56G support from the driver 56G was not cleared from advertised modes - net/smc: fix suppressed return code Previous release - always broken: - netem: fix zero division in tabledist, caused by integer overflow - bnxt_en: Re-write PCI BARs after PCI fatal error. - cxgb4: set up filter action after rewrites - net: ipa: command payloads already mapped Misc: - s390/ism: fix incorrect system EID, it's okay to change since it was added in current release - vsock: use ns_capable_noaudit() on socket create to suppress false positive audit messages" * tag 'net-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits) r8169: fix issue with forced threading in combination with shared interrupts netem: fix zero division in tabledist ibmvnic: fix ibmvnic_set_mac mptcp: add missing memory scheduling in the rx path tipc: fix memory leak caused by tipc_buf_append() gtp: fix an use-before-init in gtp_newlink() net: protect tcf_block_unbind with block lock ibmveth: Fix use of ibmveth in a bridge. net/sched: act_mpls: Add softdep on mpls_gso.ko ravb: Fix bit fields checking in ravb_hwtstamp_get() devlink: Unlock on error in dumpit() devlink: Fix some error codes chelsio/chtls: fix memory leaks in CPL handlers chelsio/chtls: fix deadlock issue net: hns3: Clear the CMDQ registers before unmapping BAR region bnxt_en: Send HWRM_FUNC_RESET fw command unconditionally. bnxt_en: Check abort error state in bnxt_open_nic(). bnxt_en: Re-write PCI BARs after PCI fatal error. bnxt_en: Invoke cancel_delayed_work_sync() for PFs also. bnxt_en: Fix regression in workqueue cleanup logic in bnxt_remove_one(). ...
2020-10-29netem: fix zero division in tabledistAleksandr Nogikh1-1/+8
Currently it is possible to craft a special netlink RTM_NEWQDISC command that can result in jitter being equal to 0x80000000. It is enough to set the 32 bit jitter to 0x02000000 (it will later be multiplied by 2^6) or just set the 64 bit jitter via TCA_NETEM_JITTER64. This causes an overflow during the generation of uniformly distributed numbers in tabledist(), which in turn leads to division by zero (sigma != 0, but sigma * 2 is 0). The related fragment of code needs 32-bit division - see commit 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to 64 bit is not an option. Fix the issue by keeping the value of jitter within the range that can be adequately handled by tabledist() - [0;INT_MAX]. As negative std deviation makes no sense, take the absolute value of the passed value and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit arithmetic in order to prevent overflows. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29mptcp: add missing memory scheduling in the rx pathPaolo Abeni1-0/+10
When moving the skbs from the subflow into the msk receive queue, we must schedule there the required amount of memory. Try to borrow the required memory from the subflow, if needed, so that we leverage the existing TCP heuristic. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/f6143a6193a083574f11b00dbf7b5ad151bc4ff4.1603810630.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29tipc: fix memory leak caused by tipc_buf_append()Tung Nguyen1-3/+2
Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") replaced skb_unshare() with skb_copy() to not reduce the data reference counter of the original skb intentionally. This is not the correct way to handle the cloned skb because it causes memory leak in 2 following cases: 1/ Sending multicast messages via broadcast link The original skb list is cloned to the local skb list for local destination. After that, the data reference counter of each skb in the original list has the value of 2. This causes each skb not to be freed after receiving ACK: tipc_link_advance_transmq() { ... /* release skb */ __skb_unlink(skb, &l->transmq); kfree_skb(skb); <-- memory exists after being freed } 2/ Sending multicast messages via replicast link Similar to the above case, each skb cannot be freed after purging the skb list: tipc_mcast_xmit() { ... __skb_queue_purge(pkts); <-- memory exists after being freed } This commit fixes this issue by using skb_unshare() instead. Besides, to avoid use-after-free error reported by KASAN, the pointer to the fragment is set to NULL before calling skb_unshare() to make sure that the original skb is not freed after freeing the fragment 2 times in case skb_unshare() returns NULL. Fixes: ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") Acked-by: Jon Maloy <jmaloy@redhat.com> Reported-by: Thang Hoang Ngo <thang.h.ngo@dektech.com.au> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29xsk: Fix possible memory leak at socket closeMagnus Karlsson2-3/+7
Fix a possible memory leak at xsk socket close that is caused by the refcounting of the umem object being wrong. The reference count of the umem was decremented only after the pool had been freed. Note that if the buffer pool is destroyed, it is important that the umem is destroyed after the pool, otherwise the umem would disappear while the driver is still running. And as the buffer pool needs to be destroyed in a work queue, the umem is also (if its refcount reaches zero) destroyed after the buffer pool in that same work queue. What was missing is that the refcount also needs to be decremented when the pool is not freed and when the pool has not even been created. The first case happens when the refcount of the pool is higher than 1, i.e. it is still being used by some other socket using the same device and queue id. In this case, it is safe to decrement the refcount of the umem outside of the work queue as the umem will never be freed because the refcount of the umem is always greater than or equal to the refcount of the buffer pool. The second case is if the buffer pool has not been created yet, i.e. the socket was closed before it was bound but after the umem was created. In this case, it is safe to destroy the umem outside of the work queue, since there is no pool that can use it by definition. Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com
2020-10-28RDMA: Add rdma_connect_locked()Jason Gunthorpe1-2/+3
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28net: protect tcf_block_unbind with block lockLeon Romanovsky1-2/+2
The tcf_block_unbind() expects that the caller will take block->cb_lock before calling it, however the code took RTNL lock and dropped cb_lock instead. This causes to the following kernel panic. WARNING: CPU: 1 PID: 13524 at net/sched/cls_api.c:1488 tcf_block_unbind+0x2db/0x420 Modules linked in: mlx5_ib mlx5_core mlxfw ptp pps_core act_mirred act_tunnel_key cls_flower vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm ib_uverbs ib_core overlay [last unloaded: mlxfw] CPU: 1 PID: 13524 Comm: test-ecmp-add-v Tainted: G W 5.9.0+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:tcf_block_unbind+0x2db/0x420 Code: ff 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8d bc 24 30 01 00 00 be ff ff ff ff e8 7d 7f 70 00 85 c0 0f 85 7b fd ff ff <0f> 0b e9 74 fd ff ff 48 c7 c7 dc 6a 24 84 e8 02 ec fe fe e9 55 fd RSP: 0018:ffff888117d17968 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88812f713c00 RCX: 1ffffffff0848d5b RDX: 0000000000000001 RSI: ffff88814fbc8130 RDI: ffff888107f2b878 RBP: 1ffff11022fa2f3f R08: 0000000000000000 R09: ffffffff84115a87 R10: fffffbfff0822b50 R11: ffff888107f2b898 R12: ffff88814fbc8000 R13: ffff88812f713c10 R14: ffff888117d17a38 R15: ffff88814fbc80c0 FS: 00007f6593d36740(0000) GS:ffff8882a4f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005607a00758f8 CR3: 0000000131aea006 CR4: 0000000000170ea0 Call Trace: tc_block_indr_cleanup+0x3e0/0x5a0 ? tcf_block_unbind+0x420/0x420 ? __mutex_unlock_slowpath+0xe7/0x610 flow_indr_dev_unregister+0x5e2/0x930 ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core] ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core] ? flow_indr_block_cb_alloc+0x3c0/0x3c0 ? mlx5_db_free+0x37c/0x4b0 [mlx5_core] mlx5e_cleanup_rep_tx+0x8b/0xc0 [mlx5_core] mlx5e_detach_netdev+0xe5/0x120 [mlx5_core] mlx5e_vport_rep_unload+0x155/0x260 [mlx5_core] esw_offloads_disable+0x227/0x2b0 [mlx5_core] mlx5_eswitch_disable_locked.cold+0x38e/0x699 [mlx5_core] mlx5_eswitch_disable+0x94/0xf0 [mlx5_core] mlx5_device_disable_sriov+0x183/0x1f0 [mlx5_core] mlx5_core_sriov_configure+0xfd/0x230 [mlx5_core] sriov_numvfs_store+0x261/0x2f0 ? sriov_drivers_autoprobe_store+0x110/0x110 ? sysfs_file_ops+0x170/0x170 ? sysfs_file_ops+0x117/0x170 ? sysfs_file_ops+0x170/0x170 kernfs_fop_write+0x1ff/0x3f0 ? rcu_read_lock_any_held+0x6e/0x90 vfs_write+0x1f3/0x620 ksys_write+0xf9/0x1d0 ? __x64_sys_read+0xb0/0xb0 ? lockdep_hardirqs_on_prepare+0x273/0x3f0 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 <...> ---[ end trace bfdd028ada702879 ]--- Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20201026123327.1141066-1-leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-28net/sched: act_mpls: Add softdep on mpls_gso.koGuillaume Nault1-0/+1
TCA_MPLS_ACT_PUSH and TCA_MPLS_ACT_MAC_PUSH might be used on gso packets. Such packets will thus require mpls_gso.ko for segmentation. v2: Drop dependency on CONFIG_NET_MPLS_GSO in Kconfig (from Jakub and David). Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/1f6cab15bbd15666795061c55563aaf6a386e90e.1603708007.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-28devlink: Unlock on error in dumpit()Dan Carpenter1-2/+4
This needs to unlock before returning. Fixes: 544e7c33ec2f ("net: devlink: Add support for port regions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201026080127.GB1628785@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-28devlink: Fix some error codesDan Carpenter1-9/+15
These paths don't set the error codes. It's especially important in devlink_nl_region_notify_build() where it leads to a NULL dereference in the caller. Fixes: 544e7c33ec2f ("net: devlink: Add support for port regions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201026080059.GA1628785@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27net/smc: fix suppressed return codeKarsten Graul1-2/+5
The patch that repaired the invalid return code in smcd_new_buf_create() missed to take care of errno ENOSPC which has a special meaning that no more DMBEs can be registered on the device. Fix that by keeping this errno value during the translation of the return code. Fixes: 6b1bbf94ab36 ("net/smc: fix invalid return code in smcd_new_buf_create()") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27net/smc: fix null pointer dereference in smc_listen_decline()Karsten Graul1-3/+4
smc_listen_work() calls smc_listen_decline() on label out_decl, providing the ini pointer variable. But this pointer can still be null when the label out_decl is reached. Fix this by checking the ini variable in smc_listen_work() and call smc_listen_decline() with the result directly. Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27vsock: use ns_capable_noaudit() on socket createJeff Vander Stoep1-1/+1
During __vsock_create() CAP_NET_ADMIN is used to determine if the vsock_sock->trusted should be set to true. This value is used later for determing if a remote connection should be allowed to connect to a restricted VM. Unfortunately, if the caller doesn't have CAP_NET_ADMIN, an audit message such as an selinux denial is generated even if the caller does not want a trusted socket. Logging errors on success is confusing. To avoid this, switch the capable(CAP_NET_ADMIN) check to the noaudit version. Reported-by: Roman Kiryanov <rkir@google.com> https://android-review.googlesource.com/c/device/generic/goldfish/+/1468545/ Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/r/20201023143757.377574-1-jeffv@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-25mm: remove kzfree() compatibility definitionEric Biggers1-2/+2
Commit 453431a54934 ("mm, treewide: rename kzfree() to kfree_sensitive()") renamed kzfree() to kfree_sensitive(), but it left a compatibility definition of kzfree() to avoid being too disruptive. Since then a few more instances of kzfree() have slipped in. Just get rid of them and remove the compatibility definition once and for all. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-24random32: add noise from network and scheduling activityWilly Tarreau1-0/+4
With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24tcp: Prevent low rmem stalls with SO_RCVLOWAT.Arjun Roy2-1/+4
With SO_RCVLOWAT, under memory pressure, it is possible to enter a state where: 1. We have not received enough bytes to satisfy SO_RCVLOWAT. 2. We have not entered buffer pressure (see tcp_rmem_pressure()). 3. But, we do not have enough buffer space to accept more packets. In this case, we advertise 0 rwnd (due to #3) but the application does not drain the receive queue (no wakeup because of #1 and #2) so the flow stalls. Modify the heuristic for SO_RCVLOWAT so that, if we are advertising rwnd<=rcv_mss, force a wakeup to prevent a stall. Without this patch, setting tcp_rmem to 6143 and disabling TCP autotune causes a stalled flow. With this patch, no stall occurs. This is with RPC-style traffic with large messages. Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Arjun Roy <arjunroy@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-23Merge tag 'net-5.10-rc1' of ↵Linus Torvalds28-127/+208
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Cross-tree/merge window issues: - rtl8150: don't incorrectly assign random MAC addresses; fix late in the 5.9 cycle started depending on a return code from a function which changed with the 5.10 PR from the usb subsystem Current release regressions: - Revert "virtio-net: ethtool configurable RXCSUM", it was causing crashes at probe when control vq was not negotiated/available Previous release regressions: - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO bus, only first device would be probed correctly - nexthop: Fix performance regression in nexthop deletion by effectively switching from recently added synchronize_rcu() to synchronize_rcu_expedited() - netsec: ignore 'phy-mode' device property on ACPI systems; the property is not populated correctly by the firmware, but firmware configures the PHY so just keep boot settings Previous releases - always broken: - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing bulk transfers getting "stuck" - icmp: randomize the global rate limiter to prevent attackers from getting useful signal - r8169: fix operation under forced interrupt threading, make the driver always use hard irqs, even on RT, given the handler is light and only wants to schedule napi (and do so through a _irqoff() variant, preferably) - bpf: Enforce pointer id generation for all may-be-null register type to avoid pointers erroneously getting marked as null-checked - tipc: re-configure queue limit for broadcast link - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels - fix various issues in chelsio inline tls driver Misc: - bpf: improve just-added bpf_redirect_neigh() helper api to support supplying nexthop by the caller - in case BPF program has already done a lookup we can avoid doing another one - remove unnecessary break statements - make MCTCP not select IPV6, but rather depend on it" * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) tcp: fix to update snd_wl1 in bulk receiver fast path net: Properly typecast int values to set sk_max_pacing_rate netfilter: nf_fwd_netdev: clear timestamp in forwarding path ibmvnic: save changed mac address to adapter->mac_addr selftests: mptcp: depends on built-in IPv6 Revert "virtio-net: ethtool configurable RXCSUM" rtnetlink: fix data overflow in rtnl_calcit() net: ethernet: mtk-star-emac: select REGMAP_MMIO net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh() bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop mptcp: depends on IPV6 but not as a module sfc: move initialisation of efx->filter_sem to efx_init_struct() mpls: load mpls_gso after mpls_iptunnel net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() net: dsa: bcm_sf2: make const array static, makes object smaller mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it ...
2020-10-23net: xfrm: fix a race condition during allocing spizhuoliang zhang1-3/+5
we found that the following race condition exists in xfrm_alloc_userspi flow: user thread state_hash_work thread ---- ---- xfrm_alloc_userspi() __find_acq_core() /*alloc new xfrm_state:x*/ xfrm_state_alloc() /*schedule state_hash_work thread*/ xfrm_hash_grow_check() xfrm_hash_resize() xfrm_alloc_spi /*hold lock*/ x->id.spi = htonl(spi) spin_lock_bh(&net->xfrm.xfrm_state_lock) /*waiting lock release*/ xfrm_hash_transfer() spin_lock_bh(&net->xfrm.xfrm_state_lock) /*add x into hlist:net->xfrm.state_byspi*/ hlist_add_head_rcu(&x->byspi) spin_unlock_bh(&net->xfrm.xfrm_state_lock) /*add x into hlist:net->xfrm.state_byspi 2 times*/ hlist_add_head_rcu(&x->byspi) 1. a new state x is alloced in xfrm_state_alloc() and added into the bydst hlist in __find_acq_core() on the LHS; 2. on the RHS, state_hash_work thread travels the old bydst and tranfers every xfrm_state (include x) into the new bydst hlist and new byspi hlist; 3. user thread on the LHS gets the lock and adds x into the new byspi hlist again. So the same xfrm_state (x) is added into the same list_hash (net->xfrm.state_byspi) 2 times that makes the list_hash become an inifite loop. To fix the race, x->id.spi = htonl(spi) in the xfrm_alloc_spi() is moved to the back of spin_lock_bh, sothat state_hash_work thread no longer add x which id.spi is zero into the hash_list. Fixes: f034b5d4efdf ("[XFRM]: Dynamic xfrm_state hash table sizing.") Signed-off-by: zhuoliang zhang <zhuoliang.zhang@mediatek.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-10-22tcp: fix to update snd_wl1 in bulk receiver fast pathNeal Cardwell1-0/+2
In the header prediction fast path for a bulk data receiver, if no data is newly acknowledged then we do not call tcp_ack() and do not call tcp_ack_update_window(). This means that a bulk receiver that receives large amounts of data can have the incoming sequence numbers wrap, so that the check in tcp_may_update_window fails: after(ack_seq, tp->snd_wl1) If the incoming receive windows are zero in this state, and then the connection that was a bulk data receiver later wants to send data, that connection can find itself persistently rejecting the window updates in incoming ACKs. This means the connection can persistently fail to discover that the receive window has opened, which in turn means that the connection is unable to send anything, and the connection's sending process can get permanently "stuck". The fix is to update snd_wl1 in the header prediction fast path for a bulk data receiver, so that it keeps up and does not see wrapping problems. This fix is based on a very nice and thorough analysis and diagnosis by Apollon Oikonomopoulos (see link below). This is a stable candidate but there is no Fixes tag here since the bug predates current git history. Just for fun: looks like the bug dates back to when header prediction was added in Linux v2.1.8 in Nov 1996. In that version tcp_rcv_established() was added, and the code only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer: receiver" code path it does not call tcp_ack(). This fix seems to apply cleanly at least as far back as v3.2. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Apollon Oikonomopoulos <apoikos@dmesg.gr> Tested-by: Apollon Oikonomopoulos <apoikos@dmesg.gr> Link: https://www.spinics.net/lists/netdev/msg692430.html Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22net: Properly typecast int values to set sk_max_pacing_rateKe Li2-2/+3
In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate, after extended from 'u32' to 'unsigned long', takes unintentionally hiked value whenever assigned from an 'int' value with MSB=1, due to binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes 0xFFFFFFFF80000000. Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return ~0U unexpectedly. It may also result in increased pacing rate. Fix by explicitly casting the 'int' value to 'unsigned int' before assigning it to sk_max_pacing_rate, for zero extension to happen. Fixes: 76a9ebe811fb ("net: extend sk_pacing_rate to unsigned long") Signed-off-by: Ji Li <jli@akamai.com> Signed-off-by: Ke Li <keli@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski10-18/+30
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Update debugging in IPVS tcp protocol handler to make it easier to understand, from longguang.yue 2) Update TCP tracker to deal with keepalive packet after re-registration, from Franceso Ruggeri. 3) Missing IP6SKB_FRAGMENTED from netfilter fragment reassembly, from Georg Kohmann. 4) Fix bogus packet drop in ebtables nat extensions, from Thimothee Cocault. 5) Fix typo in flowtable documentation. 6) Reset skb timestamp in nft_fwd_netdev. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-59/+99
Daniel Borkmann says: ==================== pull-request: bpf 2020-10-22 1) Fix enforcing NULL check in verifier for new helper return types of RET_PTR_TO_{BTF_ID,MEM_OR_BTF_ID}_OR_NULL, from Martin KaFai Lau. 2) Fix bpf_redirect_neigh() helper API before it becomes frozen by adding nexthop information as argument, from Toke Høiland-Jørgensen. 3) Guard & fix compilation of bpf_tail_call_static() when __bpf__ arch is not defined by compiler or clang too old, from Daniel Borkmann. 4) Remove misplaced break after return in attach_type_to_prog_type(), from Tom Rix. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22Merge tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds7-25/+85
Pull nfsd updates from Bruce Fields: "The one new feature this time, from Anna Schumaker, is READ_PLUS, which has the same arguments as READ but allows the server to return an array of data and hole extents. Otherwise it's a lot of cleanup and bugfixes" * tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits) NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() sunrpc: raise kernel RPC channel buffer size svcrdma: fix bounce buffers for unaligned offsets and multiple pages nfsd: remove unneeded break net/sunrpc: Fix return value for sysctl sunrpc.transports NFSD: Encode a full READ_PLUS reply NFSD: Return both a hole and a data segment NFSD: Add READ_PLUS hole segment encoding NFSD: Add READ_PLUS data support NFSD: Hoist status code encoding into XDR encoder functions NFSD: Map nfserr_wrongsec outside of nfsd_dispatch NFSD: Remove the RETURN_STATUS() macro NFSD: Call NFSv2 encoders on error returns NFSD: Fix .pc_release method for NFSv2 NFSD: Remove vestigial typedefs NFSD: Refactor nfsd_dispatch() error paths NFSD: Clean up nfsd_dispatch() variables NFSD: Clean up stale comments in nfsd_dispatch() NFSD: Clean up switch statement in nfsd_dispatch() ...
2020-10-22Merge tag '9p-for-5.10-rc1' of git://github.com/martinetd/linuxLinus Torvalds2-3/+3
Pull 9p updates from Dominique Martinet: "A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized variable warning) and code cleanup (xen)" * tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux: net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid 9p/xen: Fix format argument warning 9P: Cast to loff_t before multiplying
2020-10-22netfilter: nf_fwd_netdev: clear timestamp in forwarding pathPablo Neira Ayuso2-0/+2
Similar to 7980d2eabde8 ("ipvs: clear skb->tstamp in forwarding path"). fq qdisc requires tstamp to be cleared in forwarding path. Fixes: 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths") Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-22rtnetlink: fix data overflow in rtnl_calcit()Di Zhu1-7/+6
"ip addr show" command execute error when we have a physical network card with a large number of VFs The return value of if_nlmsg_size() in rtnl_calcit() will exceed range of u16 data type when any network cards has a larger number of VFs. rtnl_vfinfo_size() will significant increase needed dump size when the value of num_vfs is larger. Eventually we get a wrong value of min_ifinfo_dump_size because of overflow which decides the memory size needed by netlink dump and netlink_dump() will return -EMSGSIZE because of not enough memory was allocated. So fix it by promoting min_dump_alloc data type to u32 to avoid whole netlink message size overflow and it's also align with the data type of struct netlink_callback{}.min_dump_alloc which is assigned by return value of rtnl_calcit() Signed-off-by: Di Zhu <zhudi21@huawei.com> Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22bpf: Fix bpf_redirect_neigh helper api to support supplying nexthopToke Høiland-Jørgensen1-59/+99
Based on the discussion in [0], update the bpf_redirect_neigh() helper to accept an optional parameter specifying the nexthop information. This makes it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without incurring a duplicate FIB lookup - since the FIB lookup helper will return the nexthop information even if no neighbour is present, this can simply be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen. [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
2020-10-21Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds3-35/+213
Pull ceph updates from Ilya Dryomov: - a patch that removes crush_workspace_mutex (myself). CRUSH computations are no longer serialized and can run in parallel. - a couple new filesystem client metrics for "ceph fs top" command (Xiubo Li) - a fix for a very old messenger bug that affected the filesystem, marked for stable (myself) - assorted fixups and cleanups throughout the codebase from Jeff and others. * tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits) libceph: clear con->out_msg on Policy::stateful_server faults libceph: format ceph_entity_addr nonces as unsigned libceph: fix ENTITY_NAME format suggestion libceph: move a dout in queue_con_delay() ceph: comment cleanups and clarifications ceph: break up send_cap_msg ceph: drop separate mdsc argument from __send_cap ceph: promote to unsigned long long before shifting ceph: don't SetPageError on readpage errors ceph: mark ceph_fmt_xattr() as printf-like for better type checking ceph: fold ceph_update_writeable_page into ceph_write_begin ceph: fold ceph_sync_writepages into writepage_nounlock ceph: fold ceph_sync_readpages into ceph_readpage ceph: don't call ceph_update_writeable_page from page_mkwrite ceph: break out writeback of incompatible snap context to separate function ceph: add a note explaining session reject error string libceph: switch to the new "osd blocklist add" command libceph, rbd, ceph: "blacklist" -> "blocklist" ceph: have ceph_writepages_start call pagevec_lookup_range_tag ceph: use kill_anon_super helper ...
2020-10-21mptcp: depends on IPV6 but not as a moduleMatthieu Baerts1-1/+1
Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6' support also depends on CONFIG_IPV6. But not all functions from IPv6 code are exported. To simplify the code and reduce modifications outside MPTCP, it was decided from the beginning to support MPTCP with IPv6 only if CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was created. More modifications are needed to support CONFIG_IPV6=m. Even if it was not explicit, until recently, we were forcing CONFIG_IPV6 to be built-in because we had "select IPV6" in Kconfig. Now that we have "depends on IPV6", we have to explicitly set "IPV6=y" to force CONFIG_IPV6 not to be built as a module. In other words, we can now only have CONFIG_MPTCP_IPV6=y if CONFIG_IPV6=y. Note that the new dependency might hide the fact IPv6 is not supported in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did before was forcing it to be built-in while it was maybe not what the user wants. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 010b430d5df5 ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>