summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-10-13io_uring/af_unix: defer registered files gc to io_uring releasePavel Begunkov1-0/+20
Instead of putting io_uring's registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down. Because io_uring can't register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free(). Cc: stable@vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-and-tested-by: David Bouman <dbouman03@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> [axboe: add kerneldoc comment to skb, fold in skb leak fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07Merge tag 'pull-d_path' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_path updates from Al Viro. * tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_path.c: typo fix... dynamic_dname(): drop unused dentry argument
2022-10-04Merge tag 'net-next-6.1' of ↵Linus Torvalds358-16880/+10083
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...
2022-10-04Merge tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2-17/+41
Pull nfsd updates from Chuck Lever: "This release is mostly bug fixes, clean-ups, and optimizations. One notable set of fixes addresses a subtle buffer overflow issue that occurs if a small RPC Call message arrives in an oversized RPC record. This is only possible on a framed RPC transport such as TCP. Because NFSD shares the receive and send buffers in one set of pages, an oversized RPC record steals pages from the send buffer that will be used to construct the RPC Reply message. NFSD must not assume that a full-sized buffer is always available to it; otherwise, it will walk off the end of the send buffer while constructing its reply. In this release, we also introduce the ability for the server to wait a moment for clients to return delegations before it responds with NFS4ERR_DELAY. This saves a retransmit and a network round- trip when a delegation recall is needed. This work will be built upon in future releases. The NFS server adds another shrinker to its collection. Because courtesy clients can linger for quite some time, they might be freeable when the server host comes under memory pressure. A new shrinker has been added that releases courtesy client resources during low memory scenarios. Lastly, of note: the maximum number of operations per NFSv4 COMPOUND that NFSD can handle is increased from 16 to 50. There are NFSv4 client implementations that need more than 16 to successfully perform a mount operation that uses a pathname with many components" * tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (53 commits) nfsd: extra checks when freeing delegation stateids nfsd: make nfsd4_run_cb a bool return function nfsd: fix comments about spinlock handling with delegations nfsd: only fill out return pointer on success in nfsd4_lookup_stateid NFSD: fix use-after-free on source server when doing inter-server copy NFSD: Cap rsize_bop result based on send buffer size NFSD: Rename the fields in copy_stateid_t nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_file_cache_stats_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_reply_cache_stats_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define client_info_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define export_features_fops and supported_enctypes_fops nfsd: use DEFINE_PROC_SHOW_ATTRIBUTE to define nfsd_proc_ops NFSD: Pack struct nfsd4_compoundres NFSD: Remove unused nfsd4_compoundargs::cachetype field NFSD: Remove "inline" directives on op_rsize_bop helpers NFSD: Clean up nfs4svc_encode_compoundres() SUNRPC: Fix typo in xdr_buf_subsegment's kdoc comment NFSD: Clean up WRITE arg decoders NFSD: Use xdr_inline_decode() to decode NFSv3 symlinks NFSD: Refactor common code out of dirlist helpers ...
2022-10-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski14-53/+88
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-04once: rename _SLOW to _SLEEPABLEJason A. Donenfeld1-2/+2
The _SLOW designation wasn't really descriptive of anything. This is meant to be called from process context when it's possible to sleep. So name this more aptly _SLEEPABLE, which better fits its intended use. Fixes: 62c07983bef9 ("once: add DO_ONCE_SLOW() for sleepable contexts") Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221003181413.1221968-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-04ethtool: add interface to interact with Ethernet Power EquipmentOleksij Rempel5-1/+209
Add interface to support Power Sourcing Equipment. At current step it provides generic way to address all variants of PSE devices as defined in IEEE 802.3-2018 but support only objects specified for IEEE 802.3-2018 104.4 PoDL Power Sourcing Equipment (PSE). Currently supported and mandatory objects are: IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState IEEE 802.3-2018 30.15.1.2.1 acPoDLPSEAdminControl This is minimal interface needed to control PSE on each separate ethernet port but it provides not all mandatory objects specified in IEEE 802.3-2018. Since "PoDL PSE" and "PSE" have similar names, but some different values I decide to not merge them and keep separate naming schema. This should allow as to be as close to IEEE 802.3 spec as possible and avoid name conflicts in the future. This implementation is connected to PHYs instead of MACs because PSE auto classification can potentially interfere with PHY auto negotiation. So, may be some extra PHY related initialization will be needed. With WIP version of ethtools interaction with PSE capable link looks as following: $ ip l ... 5: t1l1@eth0: <BROADCAST,MULTICAST> .. ... $ ethtool --show-pse t1l1 PSE attributs for t1l1: PoDL PSE Admin State: disabled PoDL PSE Power Detection Status: disabled $ ethtool --set-pse t1l1 podl-pse-admin-control enable $ ethtool --show-pse t1l1 PSE attributs for t1l1: PoDL PSE Admin State: enabled PoDL PSE Power Detection Status: delivering power Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-04Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski3-27/+26
Daniel Borkmann says: ==================== pull-request: bpf 2022-10-03 We've added 10 non-merge commits during the last 23 day(s) which contain a total of 14 files changed, 130 insertions(+), 69 deletions(-). The main changes are: 1) Fix dynptr helper API to gate behind CAP_BPF given it was not intended for unprivileged BPF programs, from Kumar Kartikeya Dwivedi. 2) Fix need_wakeup flag inheritance from umem buffer pool for shared xsk sockets, from Jalal Mostafa. 3) Fix truncated last_member_type_id in btf_struct_resolve() which had a wrong storage type, from Lorenz Bauer. 4) Fix xsk back-pressure mechanism on tx when amount of produced descriptors to CQ is lower than what was grabbed from xsk tx ring, from Maciej Fijalkowski. 5) Fix wrong cgroup attach flags being displayed to effective progs, from Pu Lehui. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xsk: Inherit need_wakeup flag for shared sockets bpf: Gate dynptr API behind CAP_BPF selftests/bpf: Adapt cgroup effective query uapi change bpftool: Fix wrong cgroup attach flags being assigned to effective progs bpf, cgroup: Reject prog_attach_flags array when effective query bpf: Ensure correct locking around vulnerable function find_vpid() bpf: btf: fix truncated last_member_type_id in btf_struct_resolve selftests/xsk: Add missing close() on netns fd xsk: Fix backpressure mechanism on Tx MAINTAINERS: Add include/linux/tnum.h to BPF CORE ==================== Link: https://lore.kernel.org/r/20221003201957.13149-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski13-31/+343
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.cLorenzo Bianconi4-52/+87
Remove circular dependency between nf_nat module and nf_conntrack one moving bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Fixes: 0fabd2aa199f ("net: netfilter: add bpf_ct_set_nat_info kfunc helper") Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Yauheni Kaliuta <ykaliuta@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/51a65513d2cda3eeb0754842e8025ab3966068d8.1664490511.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-03once: add DO_ONCE_SLOW() for sleepable contextsEric Dumazet1-2/+2
Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation. This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context. Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table. Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03net/ieee802154: reject zero-sized raw_sendmsg()Tetsuo Handa1-0/+3
syzbot is hitting skb_assert_len() warning at raw_sendmsg() for ieee802154 socket. What commit dc633700f00f726e ("net/af_packet: check len when min_header_len equals to 0") does also applies to ieee802154 socket. Link: https://syzkaller.appspot.com/bug?extid=5ea725c25d06fb9114c4 Reported-by: syzbot <syzbot+5ea725c25d06fb9114c4@syzkaller.appspotmail.com> Fixes: fd1894224407c484 ("bpf: Don't redirect packets with invalid pkt_len") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03gro: add support of (hw)gro packets to gro stackCoco Li2-6/+29
Current GRO stack only supports incoming packets containing one frame/MSS. This patch changes GRO to accept packets that are already GRO. HW-GRO (aka RSC for some vendors) is very often limited in presence of interleaved packets. Linux SW GRO stack can complete the job and provide larger GRO packets, thus reducing rate of ACK packets and cpu overhead. This also means BIG TCP can still be used, even if HW-GRO/RSC was able to cook ~64 KB GRO packets. v2: fix logic in tcp_gro_receive() Only support TCP for the moment (Paolo) Co-Developed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03mptcp: update misleading comments.Paolo Abeni1-7/+7
The MPTCP data path is quite complex and hard to understend even without some foggy comments referring to modified code and/or completely misleading from the beginning. Update a few of them to more accurately describing the current status. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03mptcp: use fastclose on more edge scenariosPaolo Abeni1-19/+44
Daire reported a user-space application hang-up when the peer is forcibly closed before the data transfer completion. The relevant application expects the peer to either do an application-level clean shutdown or a transport-level connection reset. We can accommodate a such user by extending the fastclose usage: at fd close time, if the msk socket has some unread data, and at FIN_WAIT timeout. Note that at MPTCP close time we must ensure that the TCP subflows will reset: set the linger socket option to a suitable value. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03mptcp: propagate fastclose errorPaolo Abeni1-11/+36
When an mptcp socket is closed due to an incoming FASTCLOSE option, so specific sk_err is set and later syscall will fail usually with EPIPE. Align the current fastclose error handling with TCP reset, properly setting the socket error according to the current msk state and propagating such error. Additionally sendmsg() is currently not handling properly the sk_err, always returning EPIPE. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03af_unix: Fix memory leaks of the whole sk due to OOB skb.Kuniyuki Iwashima1-6/+7
syzbot reported a sequence of memory leaks, and one of them indicated we failed to free a whole sk: unreferenced object 0xffff8880126e0000 (size 1088): comm "syz-executor419", pid 326, jiffies 4294773607 (age 12.609s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 7d 00 00 00 00 00 00 00 ........}....... 01 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace: [<000000006fefe750>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:1970 [<0000000074006db5>] sk_alloc+0x3b/0x800 net/core/sock.c:2029 [<00000000728cd434>] unix_create1+0xaf/0x920 net/unix/af_unix.c:928 [<00000000a279a139>] unix_create+0x113/0x1d0 net/unix/af_unix.c:997 [<0000000068259812>] __sock_create+0x2ab/0x550 net/socket.c:1516 [<00000000da1521e1>] sock_create net/socket.c:1566 [inline] [<00000000da1521e1>] __sys_socketpair+0x1a8/0x550 net/socket.c:1698 [<000000007ab259e1>] __do_sys_socketpair net/socket.c:1751 [inline] [<000000007ab259e1>] __se_sys_socketpair net/socket.c:1748 [inline] [<000000007ab259e1>] __x64_sys_socketpair+0x97/0x100 net/socket.c:1748 [<000000007dedddc1>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<000000007dedddc1>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<000000009456679f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd We can reproduce this issue by creating two AF_UNIX SOCK_STREAM sockets, send()ing an OOB skb to each other, and close()ing them without consuming the OOB skbs. int skpair[2]; socketpair(AF_UNIX, SOCK_STREAM, 0, skpair); send(skpair[0], "x", 1, MSG_OOB); send(skpair[1], "x", 1, MSG_OOB); close(skpair[0]); close(skpair[1]); Currently, we free an OOB skb in unix_sock_destructor() which is called via __sk_free(), but it's too late because the receiver's unix_sk(sk)->oob_skb is accounted against the sender's sk->sk_wmem_alloc and __sk_free() is called only when sk->sk_wmem_alloc is 0. In the repro sequences, we do not consume the OOB skb, so both two sk's sock_put() never reach __sk_free() due to the positive sk->sk_wmem_alloc. Then, no one can consume the OOB skb nor call __sk_free(), and we finally leak the two whole sk. Thus, we must free the unconsumed OOB skb earlier when close()ing the socket. Fixes: 314001f0bf92 ("af_unix: Add OOB support") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03net: Add helper function to parse netlink msg of ip_tunnel_parmLiu Jian3-49/+34
Add ip_tunnel_netlink_parms to parse netlink msg of ip_tunnel_parm. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03net: Add helper function to parse netlink msg of ip_tunnel_encapLiu Jian4-107/+41
Add ip_tunnel_netlink_encap_parms to parse netlink msg of ip_tunnel_encap. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03net: rds: don't hold sock lock when cancelling work from ↵Tetsuo Handa1-1/+1
rds_tcp_reset_callbacks() syzbot is reporting lockdep warning at rds_tcp_reset_callbacks() [1], for commit ac3615e7f3cffe2a ("RDS: TCP: Reduce code duplication in rds_tcp_reset_callbacks()") added cancel_delayed_work_sync() into a section protected by lock_sock() without realizing that rds_send_xmit() might call lock_sock(). We don't need to protect cancel_delayed_work_sync() using lock_sock(), for even if rds_{send,recv}_worker() re-queued this work while __flush_work() from cancel_delayed_work_sync() was waiting for this work to complete, retried rds_{send,recv}_worker() is no-op due to the absence of RDS_CONN_UP bit. Link: https://syzkaller.appspot.com/bug?extid=78c55c7bc6f66e53dce2 [1] Reported-by: syzbot <syzbot+78c55c7bc6f66e53dce2@syzkaller.appspotmail.com> Co-developed-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+78c55c7bc6f66e53dce2@syzkaller.appspotmail.com> Fixes: ac3615e7f3cffe2a ("RDS: TCP: Reduce code duplication in rds_tcp_reset_callbacks()") Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03Merge branch 'master' of ↵David S. Miller18-256/+633
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Refactor selftests to use an array of structs in xfrm_fill_key(). From Gautam Menghani. 2) Drop an unused argument from xfrm_policy_match. From Hongbin Wang. 3) Support collect metadata mode for xfrm interfaces. From Eyal Birger. 4) Add netlink extack support to xfrm. From Sabrina Dubroca. Please note, there is a merge conflict in: include/net/dst_metadata.h between commit: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support") from the net-next tree and commit: 5182a5d48c3d ("net: allow storing xfrm interface metadata in metadata_dst") from the ipsec-next tree. Can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-02net: sched: use tc_cls_bind_class() in filterZhengchao Shao9-54/+9
Use tc_cls_bind_class() in filter. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-02net: sched: ensure n arg not empty before call bind_classZhengchao Shao1-1/+1
All bind_class callbacks are directly returned when n arg is empty. Therefore, bind_class is invoked only when n arg is not empty. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-01net: dsa: remove bool devlink_port_setupVladimir Oltean1-8/+6
Since dsa_port_devlink_setup() and dsa_port_devlink_teardown() are already called from code paths which only execute once per port (due to the existing bool dp->setup), keeping another dp->devlink_port_setup is redundant, because we can already manage to balance the calls properly (and not call teardown when setup was never called, or call setup twice, or things like that). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: dsa: don't do devlink port setup earlyJiri Pirko1-94/+82
Commit 3122433eb533 ("net: dsa: Register devlink ports before calling DSA driver setup()") moved devlink port setup to be done early before driver setup() was called. That is no longer needed, so move the devlink port initialization back to dsa_port_setup(), as the first thing done there. Note there is no longer needed to reinit port as unused if dsa_port_setup() fails, as it unregisters the devlink port instance on the error path. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: dsa: don't leave dangling pointers in dp->pl when failingVladimir Oltean3-10/+19
There is a desire to simplify the dsa_port registration path with devlink, and this involves reworking a bit how user ports which fail to connect to their PHY (because it's missing) get reinitialized as UNUSED devlink ports. The desire is for the change to look something like this; basically dsa_port_setup() has failed, we just change dp->type and call dsa_port_setup() again. -/* Destroy the current devlink port, and create a new one which has the UNUSED - * flavour. - */ -static int dsa_port_reinit_as_unused(struct dsa_port *dp) +static int dsa_port_setup_as_unused(struct dsa_port *dp) { - dsa_port_devlink_teardown(dp); dp->type = DSA_PORT_TYPE_UNUSED; - return dsa_port_devlink_setup(dp); + return dsa_port_setup(dp); } For an UNUSED port, dsa_port_setup() mostly only calls dsa_port_devlink_setup() anyway, so we could get away with calling just that. But if we call the full blown dsa_port_setup(dp) (which will be needed to properly set dp->setup = true), the callee will have the tendency to go through this code block too, and call dsa_port_disable(dp): switch (dp->type) { case DSA_PORT_TYPE_UNUSED: dsa_port_disable(dp); break; That is not very good, because dsa_port_disable() has this hidden inside of it: if (dp->pl) phylink_stop(dp->pl); Fact is, we are not prepared to handle a call to dsa_port_disable() with a struct dsa_port that came from a previous (and failed) call to dsa_port_setup(). We do not clean up dp->pl, and this will make the second call to dsa_port_setup() call phylink_stop() on a dangling dp->pl pointer. Solve this by creating an API for phylink destruction which is symmetric to the phylink creation, and never leave dp->pl set to anything except NULL or a valid phylink structure. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: dsa: move port_setup/teardown to be called outside devlink port ↵Jiri Pirko1-42/+26
registered area Move port_setup() op to be called before devlink_port_register() and port_teardown() after devlink_port_unregister(). Note it makes sense to move this alongside the rest of the devlink port code, the reinit() function also gets much nicer, as clearly the fact that port_setup()->devlink_port_region_create() was called in dsa_port_setup did not fit the flow. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: devlink: add port_init/fini() helpers to allow ↵Jiri Pirko1-3/+43
pre-register/post-unregister functions Lifetime of some of the devlink objects, like regions, is currently forced to be different for devlink instance and devlink port instance (per-port regions). The reason is that for devlink ports, the internal structures initialization happens only after devlink_port_register() is called. To resolve this inconsistency, introduce new set of helpers to allow driver to initialize devlink pointer and region list before devlink_register() is called. That allows port regions to be created before devlink port registration and destroyed after devlink port unregistration. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: devlink: introduce a flag to indicate devlink port being registeredJiri Pirko1-2/+4
Instead of relying on devlink pointer not being initialized, introduce an extra flag to indicate if devlink port is registered. This is needed as later on devlink pointer is going to be initialized even in case devlink port is not registered yet. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: devlink: introduce port registered assert helper and use itJiri Pirko1-13/+19
Instead of checking devlink_port->devlink pointer for not being NULL which indicates that devlink port is registered, put this check to new pair of helpers similar to what we have for devlink and use them in other functions. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01Merge tag 'for-net-next-2022-09-30' of ↵Jakub Kicinski14-1649/+1565
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next - Add RTL8761BUV device (Edimax BT-8500) - Add a new PID/VID 13d3/3583 for MT7921 - Add Realtek RTL8852C support ID 0x13D3:0x3592 - Add VID/PID 0489/e0e0 for MediaTek MT7921 - Add a new VID/PID 0e8d/0608 for MT7921 - Add a new PID/VID 13d3/3578 for MT7921 - Add BT device 0cb8:c549 from RTW8852AE - Add support for Intel Magnetor * tag 'for-net-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits) Bluetooth: hci_sync: Fix not indicating power state Bluetooth: L2CAP: Fix user-after-free Bluetooth: Call shutdown for HCI_USER_CHANNEL Bluetooth: Prevent double register of suspend Bluetooth: hci_core: Fix not handling link timeouts propertly Bluetooth: hci_event: Make sure ISO events don't affect non-ISO connections Bluetooth: hci_debugfs: Fix not checking conn->debugfs Bluetooth: hci_sysfs: Fix attempting to call device_add multiple times Bluetooth: MGMT: fix zalloc-simple.cocci warnings Bluetooth: hci_{ldisc,serdev}: check percpu_init_rwsem() failure Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works Bluetooth: L2CAP: initialize delayed works at l2cap_chan_create() Bluetooth: RFCOMM: Fix possible deadlock on socket shutdown/release Bluetooth: hci_sync: allow advertise when scan without RPA Bluetooth: btusb: Add a new VID/PID 0e8d/0608 for MT7921 Bluetooth: btusb: Add a new PID/VID 13d3/3583 for MT7921 Bluetooth: avoid hci_dev_test_and_set_flag() in mgmt_init_hdev() Bluetooth: btintel: Mark Intel controller to support LE_STATES quirk Bluetooth: btintel: Add support for Magnetor Bluetooth: btusb: Add a new PID/VID 13d3/3578 for MT7921 ... ==================== Link: https://lore.kernel.org/r/20221001004602.297366-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01genetlink: reject use of nlmsg_flags for new commandsJakub Kicinski1-1/+31
Commit 9c5d03d36251 ("genetlink: start to validate reserved header bytes") introduced extra validation for genetlink headers. We had to gate it to only apply to new commands, to maintain bug-wards compatibility. Use this opportunity (before the new checks make it to Linus's tree) to add more conditions. Validate that Generic Netlink families do not use nlmsg_flags outside of the well-understood set. Link: https://lore.kernel.org/all/20220928073709.1b93b74a@kernel.org/ Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20220929142809.1167546-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01Bluetooth: hci_sync: Fix not indicating power stateLuiz Augusto von Dentz1-0/+1
When setting power state using legacy/non-mgmt API (e.g hcitool hci0 up) the likes of mgmt_set_powered_complete won't be called causing clients of the MGMT API to not be notified of the change of the state. Fixes: cf75ad8b41d2 ("Bluetooth: hci_sync: Convert MGMT_SET_POWERED") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Tedd Ho-Jeong An <tedd.an@intel.com>
2022-09-30Merge tag 'wireless-next-2022-09-30' of ↵Jakub Kicinski19-288/+766
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.1 Few stack changes and lots of driver changes in this round. brcmfmac has more activity as usual and it gets new hardware support. ath11k improves WCN6750 support and also other smaller features. And of course changes all over. Note: in early September wireless tree was merged to wireless-next to avoid some conflicts with mac80211 patches, this shouldn't cause any problems but wanted to mention anyway. Major changes: mac80211 - refactoring and preparation for Wi-Fi 7 Multi-Link Operation (MLO) feature continues brcmfmac - support CYW43439 SDIO chipset - support BCM4378 on Apple platforms - support CYW89459 PCIe chipset rtw89 - more work to get rtw8852c supported - P2P support - support for enabling and disabling MSDU aggregation via nl80211 mt76 - tx status reporting improvements ath11k - cold boot calibration support on WCN6750 - Target Wake Time (TWT) debugfs support for STA interface - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - implement SRAM dump debugfs interface - enable threaded NAPI on all hardware - WoW support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz wcn36xx - add SNR from a received frame as a source of system entropy * tag 'wireless-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (231 commits) wifi: rtl8xxxu: Improve rtl8xxxu_queue_select wifi: rtl8xxxu: Fix AIFS written to REG_EDCA_*_PARAM wifi: rtl8xxxu: gen2: Enable 40 MHz channel width wifi: rtw89: 8852b: configure DLE mem wifi: rtw89: check DLE FIFO size with reserved size wifi: rtw89: mac: correct register of report IMR wifi: rtw89: pci: set power cut closed for 8852be wifi: rtw89: pci: add to do PCI auto calibration wifi: rtw89: 8852b: implement chip_ops::{enable,disable}_bb_rf wifi: rtw89: add DMA busy checking bits to chip info wifi: rtw89: mac: define DMA channel mask to avoid unsupported channels wifi: rtw89: pci: mask out unsupported TX channels iwlegacy: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper ipw2x00: Replace zero-length array with DECLARE_FLEX_ARRAY() helper wifi: iwlwifi: Track scan_cmd allocation size explicitly brcmfmac: Remove the call to "dtim_assoc" IOVAR brcmfmac: increase dcmd maximum buffer size brcmfmac: Support 89459 pcie brcmfmac: increase default max WOWL patterns to 16 cw1200: fix incorrect check to determine if no element is found in list ... ==================== Link: https://lore.kernel.org/r/20220930150413.A7984C433D6@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30xsk: Expose min chunk size to driversMaxim Mikityanskiy1-2/+0
Drivers should be aware of the range of valid UMEM chunk sizes to be able to allocate their internal structures of an appropriate size. It will be used by mlx5e in the following patches. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> CC: "Björn Töpel" <bjorn@kernel.org> CC: Magnus Karlsson <magnus.karlsson@intel.com> CC: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30ip6_vti:Remove the space before the commaHongbin Wang1-1/+1
There should be no space before the comma Signed-off-by: Hongbin Wang <wh_bin@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30Merge branch 'master' of ↵David S. Miller4-7/+22
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-09-29 1) Use the inner instead of the outer protocol for GSO on inter address family tunnels. This fixes the GSO case for address family tunnels. From Sabrina Dubroca. 2) Reset ipcomp_scratches with NULL when freed, otherwise it holds obsolete address. From Khalid Masum. 3) Reinject transport-mode packets through workqueue instead of a tasklet. The tasklet might take too long to finish. From Liu Jian. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell2-7/+14
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30sctp: handle the error returned from sctp_auth_asoc_init_active_keyXin Long1-4/+14
When it returns an error from sctp_auth_asoc_init_active_key(), the active_key is actually not updated. The old sh_key will be freeed while it's still used as active key in asoc. Then an use-after-free will be triggered when sending patckets, as found by syzbot: sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112 sctp_set_owner_w net/sctp/socket.c:132 [inline] sctp_sendmsg_to_asoc+0xbd5/0x1a20 net/sctp/socket.c:1863 sctp_sendmsg+0x1053/0x1d50 net/sctp/socket.c:2025 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 This patch is to fix it by not replacing the sh_key when it returns errors from sctp_auth_asoc_init_active_key() in sctp_auth_set_key(). For sctp_auth_set_active_key(), old active_key_id will be set back to asoc->active_key_id when the same thing happens. Fixes: 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced") Reported-by: syzbot+a236dd8e9622ed8954a3@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: bridge: assign path_cost for 2.5G and 5G link speedSteven Hsieh1-1/+10
As 2.5G, 5G ethernet ports are more common and affordable, these ports are being used in LAN bridge devices. STP port_cost() is missing path_cost assignment for these link speeds, causes highest cost 100 being used. This result in lower speed port being picked when there is loop between 5G and 1G ports. Original path_cost: 10G=2, 1G=4, 100m=19, 10m=100 Adjusted path_cost: 10G=2, 5G=3, 2.5G=4, 1G=5, 100m=19, 10m=100 speed greater than 10G = 1 Signed-off-by: Steven Hsieh <steven.hsieh@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net-next: skbuff: refactor pskb_pullRichard Gobert3-4/+4
pskb_may_pull already contains all of the checks performed by pskb_pull. Use pskb_may_pull for validation in pskb_pull, eliminating the duplication and making __pskb_pull obsolete. Replace __pskb_pull with pskb_pull where applicable. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net-sysfs: Convert to use sysfs_emit() APIsWang Yufen1-29/+29
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net/sched: taprio: allow user input of per-tc max SDUVladimir Oltean1-1/+151
IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the existence of a per traffic class limitation of maximum frame sizes, with a fallback on the port-based MTU. As far as I am able to understand, the 802.1Q Service Data Unit (SDU) represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding any number of prepended VLAN headers which may be otherwise present in the MSDU. Therefore, the queueMaxSDU is directly comparable to the device MTU (1500 means L2 payload sizes are accepted, or frame sizes of 1518 octets, or 1522 plus one VLAN header). Drivers which offload this are directly responsible of translating into other units of measurement. To keep the fast path checks optimized, we keep 2 arrays in the qdisc, one for max_sdu translated into frame length (so that it's comparable to skb->len), and another for offloading and for dumping back to the user. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net/sched: query offload capabilities through ndo_setup_tc()Vladimir Oltean1-0/+17
When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net/tipc: Remove unused struct distr_queue_itemYuan Can1-8/+0
After commit 09b5678c778f("tipc: remove dead code in tipc_net and relatives"), struct distr_queue_item is not used any more and can be removed as well. Signed-off-by: Yuan Can <yuancan@huawei.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20220928085636.71749-1-yuancan@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net: skb: introduce and use a single page frag cachePaolo Abeni2-22/+103
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net: sched: cls_u32: Avoid memcpy() false-positive warningKees Cook1-1/+5
To work around a misbehavior of the compiler's ability to see into composite flexible array structs (as detailed in the coming memcpy() hardening series[1]), use unsafe_memcpy(), as the sizing, bounds-checking, and allocation are all very tightly coupled here. This silences the false-positive reported by syzbot: memcpy: detected field-spanning write (size 80) of single field "&n->sel" at net/sched/cls_u32.c:1043 (size 16) [1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Reported-by: syzbot+a2c4601efc75848ba321@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/000000000000a96c0b05e97f0444@google.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220927153700.3071688-1-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30vhost/vsock: Use kvmalloc/kvfree for larger packets.Junichi Uekawa1-1/+1
When copying a large file over sftp over vsock, data size is usually 32kB, and kmalloc seems to fail to try to allocate 32 32kB regions. vhost-5837: page allocation failure: order:4, mode:0x24040c0 Call Trace: [<ffffffffb6a0df64>] dump_stack+0x97/0xdb [<ffffffffb68d6aed>] warn_alloc_failed+0x10f/0x138 [<ffffffffb68d868a>] ? __alloc_pages_direct_compact+0x38/0xc8 [<ffffffffb664619f>] __alloc_pages_nodemask+0x84c/0x90d [<ffffffffb6646e56>] alloc_kmem_pages+0x17/0x19 [<ffffffffb6653a26>] kmalloc_order_trace+0x2b/0xdb [<ffffffffb66682f3>] __kmalloc+0x177/0x1f7 [<ffffffffb66e0d94>] ? copy_from_iter+0x8d/0x31d [<ffffffffc0689ab7>] vhost_vsock_handle_tx_kick+0x1fa/0x301 [vhost_vsock] [<ffffffffc06828d9>] vhost_worker+0xf7/0x157 [vhost] [<ffffffffb683ddce>] kthread+0xfd/0x105 [<ffffffffc06827e2>] ? vhost_dev_set_owner+0x22e/0x22e [vhost] [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 [<ffffffffb6eb332e>] ret_from_fork+0x4e/0x80 [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 Work around by doing kvmalloc instead. Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko") Signed-off-by: Junichi Uekawa <uekawa@chromium.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20220928064538.667678-1-uekawa@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski11-47/+45
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29Bluetooth: L2CAP: Fix user-after-freeLuiz Augusto von Dentz1-0/+7
This uses l2cap_chan_hold_unless_zero() after calling __l2cap_get_chan_blah() to prevent the following trace: Bluetooth: l2cap_core.c:static void l2cap_chan_destroy(struct kref *kref) Bluetooth: chan 0000000023c4974d Bluetooth: parent 00000000ae861c08 ================================================================== BUG: KASAN: use-after-free in __mutex_waiter_is_first kernel/locking/mutex.c:191 [inline] BUG: KASAN: use-after-free in __mutex_lock_common kernel/locking/mutex.c:671 [inline] BUG: KASAN: use-after-free in __mutex_lock+0x278/0x400 kernel/locking/mutex.c:729 Read of size 8 at addr ffff888006a49b08 by task kworker/u3:2/389 Link: https://lore.kernel.org/lkml/20220622082716.478486-1-lee.jones@linaro.org Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>