summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-10-27net/tcp: Introduce TCP_AO setsockopt()sDmitry Safonov11-18/+952
Add 3 setsockopt()s: 1. TCP_AO_ADD_KEY to add a new Master Key Tuple (MKT) on a socket 2. TCP_AO_DEL_KEY to delete present MKT from a socket 3. TCP_AO_INFO to change flags, Current_key/RNext_key on a TCP-AO sk Userspace has to introduce keys on every socket it wants to use TCP-AO option on, similarly to TCP_MD5SIG/TCP_MD5SIG_EXT. RFC5925 prohibits definition of MKTs that would match the same peer, so do sanity checks on the data provided by userspace. Be as conservative as possible, including refusal of defining MKT on an established connection with no AO, removing the key in-use and etc. (1) and (2) are to be used by userspace key manager to add/remove keys. (3) main purpose is to set RNext_key, which (as prescribed by RFC5925) is the KeyID that will be requested in TCP-AO header from the peer to sign their segments with. At this moment the life of ao_info ends in tcp_v4_destroy_sock(). Co-developed-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Co-developed-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27net/tcp: Add TCP-AO config and structuresDmitry Safonov5-8/+114
Introduce new kernel config option and common structures as well as helpers to be used by TCP-AO code. Co-developed-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Co-developed-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27net/tcp: Prepare tcp_md5sig_pool for TCP-AODmitry Safonov8-211/+525
TCP-AO, similarly to TCP-MD5, needs to allocate tfms on a slow-path, which is setsockopt() and use crypto ahash requests on fast paths, which are RX/TX softirqs. Also, it needs a temporary/scratch buffer for preparing the hash. Rework tcp_md5sig_pool in order to support other hashing algorithms than MD5. It will make it possible to share pre-allocated crypto_ahash descriptors and scratch area between all TCP hash users. Internally tcp_sigpool calls crypto_clone_ahash() API over pre-allocated crypto ahash tfm. Kudos to Herbert, who provided this new crypto API. I was a little concerned over GFP_ATOMIC allocations of ahash and crypto_request in RX/TX (see tcp_sigpool_start()), so I benchmarked both "backends" with different algorithms, using patched version of iperf3[2]. On my laptop with i7-7600U @ 2.80GHz: clone-tfm per-CPU-requests TCP-MD5 2.25 Gbits/sec 2.30 Gbits/sec TCP-AO(hmac(sha1)) 2.53 Gbits/sec 2.54 Gbits/sec TCP-AO(hmac(sha512)) 1.67 Gbits/sec 1.64 Gbits/sec TCP-AO(hmac(sha384)) 1.77 Gbits/sec 1.80 Gbits/sec TCP-AO(hmac(sha224)) 1.29 Gbits/sec 1.30 Gbits/sec TCP-AO(hmac(sha3-512)) 481 Mbits/sec 480 Mbits/sec TCP-AO(hmac(md5)) 2.07 Gbits/sec 2.12 Gbits/sec TCP-AO(hmac(rmd160)) 1.01 Gbits/sec 995 Mbits/sec TCP-AO(cmac(aes128)) [not supporetd yet] 2.11 Gbits/sec So, it seems that my concerns don't have strong grounds and per-CPU crypto_request allocation can be dropped/removed from tcp_sigpool once ciphers get crypto_clone_ahash() support. [1]: https://lore.kernel.org/all/ZDefxOq6Ax0JeTRH@gondor.apana.org.au/T/#u [2]: https://github.com/0x7f454c46/iperf/tree/tcp-md5-ao Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27MAINTAINERS: Remove linuxwwan@intel.com mailing listBagas Sanjaya1-3/+0
Messages submitted to the ML bounce (address not found error). In fact, the ML was mistagged as person maintainer instead of mailing list. Remove the ML to keep Cc: lists a bit shorter and not to spam everyone's inbox with postmaster notifications. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231025130332.67995-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27Merge branch 'intel-wired-lan-driver-updates-for-2023-10-25-ice'Jakub Kicinski12-171/+650
Jacob Keller says: ==================== Intel Wired LAN Driver Updates for 2023-10-25 (ice) This series extends the ice driver with basic support for the E830 device line. It does not include support for all device features, but enables basic functionality to load and pass traffic. Alice adds the 200G speed and PHY types supported by E830 hardware. Dan extends the DDP package logic to support the E830 package segment. Paul adds the basic registers and macros used by E830 hardware, and adds support for handling variable length link status information from firmware. Pawel removes some redundant zeroing of the PCI IDs list, and extends the list to include the E830 device IDs. ==================== Link: https://lore.kernel.org/r/20231025214157.1222758-1-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Hook up 4 E830 devices by adding their IDsPawel Chmielewski1-0/+4
As the previous patches provide support for E830 hardware, add E830 specific IDs to the PCI device ID table, so these devices can now be probed by the kernel. Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-7-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Remove redundant zeroing of the fields.Pawel Chmielewski1-27/+27
Remove zeroing of the fields, as all the fields are in fact initialized with zeros automatically Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-6-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Add support for E830 DDP package segmentDan Nowlin3-74/+382
Add support for E830 DDP package segment. For the E830 package, signature buffers will not be included inline in the configuration buffers. Instead, the signature buffers will be located in a signature segment. Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Dan Nowlin <dan.nowlin@intel.com> Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-5-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Add ice_get_link_status_datalenPaul Greenwalt2-6/+53
The Get Link Status data length can vary with different versions of ice_aqc_get_link_status_data. Add ice_get_link_status_datalen() to return datalen for the specific ice_aqc_get_link_status_data version. Add new link partner fields to ice_aqc_get_link_status_data; PHY type, FEC, and flow control. Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Co-developed-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-4-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Add 200G speed/phy type useAlice Michael4-3/+43
Add the support for 200G phy speeds and the mapping for their advertisement in link. Add the new PHY type bits for AQ command, as needed for 200G E830 controllers. Signed-off-by: Alice Michael <alice.michael@intel.com> Co-developed-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-3-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27ice: Add E830 device IDs, MAC type and registersPaul Greenwalt7-61/+141
E830 is the 200G NIC family which uses the ice driver. Add specific E830 registers. Embed macros to use proper register based on (hw)->mac_type & name those macros to [ORIGINAL]_BY_MAC(hw). Registers only available on one of the macs will need to be explicitly referred to as E800_NAME instead of just NAME. PTP is not yet supported. Co-developed-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Co-developed-by: Dan Nowlin <dan.nowlin@intel.com> Signed-off-by: Dan Nowlin <dan.nowlin@intel.com> Co-developed-by: Scott Taylor <scott.w.taylor@intel.com> Signed-off-by: Scott Taylor <scott.w.taylor@intel.com> Co-developed-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025214157.1222758-2-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27Merge tag 'wireless-next-2023-10-26' of ↵Jakub Kicinski193-1726/+3560
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.7 The third, and most likely the last, features pull request for v6.7. Fixes all over and only few small new features. Major changes: iwlwifi - more Multi-Link Operation (MLO) work ath12k - QCN9274: mesh support ath11k - firmware-2.bin container file format support * tag 'wireless-next-2023-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (155 commits) wifi: ray_cs: Remove unnecessary (void*) conversions Revert "wifi: ath11k: call ath11k_mac_fils_discovery() without condition" wifi: ath12k: Introduce and use ath12k_sta_to_arsta() wifi: ath12k: fix htt mlo-offset event locking wifi: ath12k: fix dfs-radar and temperature event locking wifi: ath11k: fix gtk offload status event locking wifi: ath11k: fix htt pktlog locking wifi: ath11k: fix dfs radar event locking wifi: ath11k: fix temperature event locking wifi: ath12k: rename the sc naming convention to ab wifi: ath12k: rename the wmi_sc naming convention to wmi_ab wifi: ath11k: add firmware-2.bin support wifi: ath11k: qmi: refactor ath11k_qmi_m3_load() wifi: rtw89: cleanup firmware elements parsing wifi: rt2x00: rework MT7620 PA/LNA RF calibration wifi: rt2x00: rework MT7620 channel config function wifi: rt2x00: improve MT7620 register initialization MAINTAINERS: wifi: rt2x00: drop Helmut Schaa wifi: wlcore: main: replace deprecated strncpy with strscpy wifi: wlcore: boot: replace deprecated strncpy with strscpy ... ==================== Link: https://lore.kernel.org/r/20231026090411.B2426C433CB@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27Merge tag 'for-netdev' of ↵Jakub Kicinski75-200/+5037
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-26 We've added 51 non-merge commits during the last 10 day(s) which contain a total of 75 files changed, 5037 insertions(+), 200 deletions(-). The main changes are: 1) Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF, from Chuyi Zhou. 2) Fix BPF verifier's iterator convergence logic to use exact states comparison for convergence checks, from Eduard Zingerman, Andrii Nakryiko and Alexei Starovoitov. 3) Add BPF programmable net device where bpf_mprog defines the logic of its xmit routine. It can operate in L3 and L2 mode, from Daniel Borkmann and Nikolay Aleksandrov. 4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking for global per-CPU allocator, from Hou Tao. 5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section was going to be present whenever a binary has SHT_GNU_versym section, from Andrii Nakryiko. 6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into atomic_set_release(), from Paul E. McKenney. 7) Add a warning if NAPI callback missed xdp_do_flush() under CONFIG_DEBUG_NET which helps checking if drivers were missing the former, from Sebastian Andrzej Siewior. 8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing a warning under sleepable programs, from Yafang Shao. 9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before checking map_locked, from Song Liu. 10) Make BPF CI linked_list failure test more robust, from Kumar Kartikeya Dwivedi. 11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik. 12) Fix xsk starving when multiple xsk sockets were associated with a single xsk_buff_pool, from Albert Huang. 13) Clarify the signed modulo implementation for the BPF ISA standardization document that it uses truncated division, from Dave Thaler. 14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider signed bounds knowledge, from Andrii Nakryiko. 15) Add an option to XDP selftests to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP programs as capable to use frags, from Larysa Zaremba. 16) Fix bpftool's BTF dumper wrt printing a pointer value and another one to fix struct_ops dump in an array, from Manu Bretelle. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits) netkit: Remove explicit active/peer ptr initialization selftests/bpf: Fix selftests broken by mitigations=off samples/bpf: Allow building with custom bpftool samples/bpf: Fix passing LDFLAGS to libbpf samples/bpf: Allow building with custom CFLAGS/LDFLAGS bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free selftests/bpf: Add selftests for netkit selftests/bpf: Add netlink helper library bpftool: Extend net dump with netkit progs bpftool: Implement link show support for netkit libbpf: Add link-based API for netkit tools: Sync if_link uapi header netkit, bpf: Add bpf programmable net device bpf: Improve JEQ/JNE branch taken logic bpf: Fold smp_mb__before_atomic() into atomic_set_release() bpf: Fix unnecessary -EBUSY from htab_lock_bucket xsk: Avoid starving the xsk further down the list bpf: print full verifier states on infinite loop detection selftests/bpf: test if state loops are detected in a tricky case bpf: correct loop detection for iterators convergence ... ==================== Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27MAINTAINERS: Maintainer change for ptp_vmw driverAlexey Makhalov1-1/+1
Deep has decided to transfer the maintainership of the VMware virtual PTP clock driver (ptp_vmw) to Jeff. Update the MAINTAINERS file to reflect this change. Signed-off-by: Alexey Makhalov <amakhalov@vmware.com> Acked-by: Deep Shah <sdeep@vmware.com> Acked-by: Jeff Sipek <jsipek@vmware.com> Link: https://lore.kernel.org/r/20231025231931.76842-1-amakhalov@vmware.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27bnxt_en: Fix 2 stray ethtool -S countersMichael Chan1-6/+22
The recent firmware interface change has added 2 counters in struct rx_port_stats_ext. This caused 2 stray ethtool counters to be displayed. Since new counters are added from time to time, fix it so that the ethtool logic will only display up to the maximum known counters. These 2 counters are not used by production firmware yet. Fixes: 754fbf604ff6 ("bnxt_en: Update firmware interface to 1.10.2.171") Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20231026013231.53271-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27tools: ynl-gen: respect attr-cnt-name at the attr set levelJakub Kicinski1-3/+4
Davide reports that we look for the attr-cnt-name in the wrong object. We try to read it from the family, but the schema only allows for it to exist at attr-set level. Reported-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/all/CAKa-r6vCj+gPEUKpv7AsXqM77N6pB0evuh7myHq=585RA3oD5g@mail.gmail.com/ Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231025182739.184706-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27netlink: specs: support conditional operationsJakub Kicinski4-0/+37
Page pool code is compiled conditionally, but the operations are part of the shared netlink family. We can handle this by reporting empty list of pools or -EOPNOTSUPP / -ENOSYS but the cleanest way seems to be removing the ops completely at compilation time. That way user can see that the page pool ops are not present using genetlink introspection. Same way they'd check if the kernel is "new enough" to support the ops. Extend the specs with the ability to specify the config condition under which op (and its policies, etc.) should be hidden. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231025162253.133159-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27netlink: make range pointers in policies constJakub Kicinski10-11/+11
struct nla_policy is usually constant itself, but unless we make the ranges inside constant we won't be able to make range structs const. The ranges are not modified by the core. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231025162204.132528-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27net/mlx5: fix uninit value usePrzemek Kitszel2-3/+11
Avoid use of uninitialized state variable. In case of mlx5e_tx_reporter_build_diagnose_output_sq_common() it's better to still collect other data than bail out entirely. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/netdev/8bd30131-c9f2-4075-a575-7fa2793a1760@moroto.mountain Fixes: d17f98bf7cc9 ("net/mlx5: devlink health: use retained error fmsg API") Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://lore.kernel.org/r/20231025145050.36114-1-przemyslaw.kitszel@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski205-760/+1934
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c 91535613b609 ("wifi: mac80211: don't drop all unprotected public action frames") 6c02fab72429 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value") Adjacent changes: drivers/net/ethernet/apm/xgene/xgene_enet_main.c 61471264c018 ("net: ethernet: apm: Convert to platform remove callback returning void") d2ca43f30611 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF") net/vmw_vsock/virtio_transport.c 64c99d2d6ada ("vsock/virtio: support to send non-linear skb") 53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge tag 'net-6.6-rc8' of ↵Linus Torvalds34-214/+458
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from WiFi and netfilter. Most regressions addressed here come from quite old versions, with the exceptions of the iavf one and the WiFi fixes. No known outstanding reports or investigation. Fixes to fixes: - eth: iavf: in iavf_down, disable queues when removing the driver Previous releases - regressions: - sched: act_ct: additional checks for outdated flows - tcp: do not leave an empty skb in write queue - tcp: fix wrong RTO timeout when received SACK reneging - wifi: cfg80211: pass correct pointer to rdev_inform_bss() - eth: i40e: sync next_to_clean and next_to_process for programming status desc - eth: iavf: initialize waitqueues before starting watchdog_task Previous releases - always broken: - eth: r8169: fix data-races - eth: igb: fix potential memory leak in igb_add_ethtool_nfc_entry - eth: r8152: avoid writing garbage to the adapter's registers - eth: gtp: fix fragmentation needed check with gso" * tag 'net-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) iavf: in iavf_down, disable queues when removing the driver vsock/virtio: initialize the_virtio_vsock before using VQs net: ipv6: fix typo in comments net: ipv4: fix typo in comments net/sched: act_ct: additional checks for outdated flows netfilter: flowtable: GC pushes back packets to classic path i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR gtp: fix fragmentation needed check with gso gtp: uapi: fix GTPA_MAX Fix NULL pointer dereference in cn_filter() sfc: cleanup and reduce netlink error messages net/handshake: fix file ref count in handshake_nl_accept_doit() wifi: mac80211: don't drop all unprotected public action frames wifi: cfg80211: fix assoc response warning on failed links wifi: cfg80211: pass correct pointer to rdev_inform_bss() isdn: mISDN: hfcsusb: Spelling fix in comment tcp: fix wrong RTO timeout when received SACK reneging r8152: Block future register access if register access fails r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en() ...
2023-10-26netkit: Remove explicit active/peer ptr initializationNikolay Aleksandrov1-4/+0
Remove the explicit NULLing of active/peer pointers and rely on the implicit one done at net device allocation. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231026094106.1505892-2-razor@blackwall.org
2023-10-26selftests/bpf: Fix selftests broken by mitigations=offYafang Shao1-1/+32
When we configure the kernel command line with 'mitigations=off' and set the sysctl knob 'kernel.unprivileged_bpf_disabled' to 0, the commit bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations") causes issues in the execution of `test_progs -t verifier`. This is because 'mitigations=off' bypasses Spectre v1 and Spectre v4 protections. Currently, when a program requests to run in unprivileged mode (kernel.unprivileged_bpf_disabled = 0), the BPF verifier may prevent it from running due to the following conditions not being enabled: - bypass_spec_v1 - bypass_spec_v4 - allow_ptr_leaks - allow_uninit_stack While 'mitigations=off' enables the first two conditions, it does not enable the latter two. As a result, some test cases in 'test_progs -t verifier' that were expected to fail to run may run successfully, while others still fail but with different error messages. This makes it challenging to address them comprehensively. Moreover, in the future, we may introduce more fine-grained control over CPU mitigations, such as enabling only bypass_spec_v1 or bypass_spec_v4. Given the complexity of the situation, rather than fixing each broken test case individually, it's preferable to skip them when 'mitigations=off' is in effect and introduce specific test cases for the new 'mitigations=off' scenario. For instance, we can introduce new BTF declaration tags like '__failure__nospec', '__failure_nospecv1' and '__failure_nospecv4'. In this patch, the approach is to simply skip the broken test cases when 'mitigations=off' is enabled. The result of `test_progs -t verifier` as follows after this commit, Before this commit ================== - without 'mitigations=off' - kernel.unprivileged_bpf_disabled = 2 Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED - kernel.unprivileged_bpf_disabled = 0 Summary: 74/1336 PASSED, 0 SKIPPED, 0 FAILED <<<< - with 'mitigations=off' - kernel.unprivileged_bpf_disabled = 2 Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED - kernel.unprivileged_bpf_disabled = 0 Summary: 63/1276 PASSED, 0 SKIPPED, 11 FAILED <<<< 11 FAILED After this commit ================= - without 'mitigations=off' - kernel.unprivileged_bpf_disabled = 2 Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED - kernel.unprivileged_bpf_disabled = 0 Summary: 74/1336 PASSED, 0 SKIPPED, 0 FAILED <<<< - with this patch, with 'mitigations=off' - kernel.unprivileged_bpf_disabled = 2 Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED - kernel.unprivileged_bpf_disabled = 0 Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED <<<< SKIPPED Fixes: bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Closes: https://lore.kernel.org/bpf/CAADnVQKUBJqg+hHtbLeeC2jhoJAWqnmRAzXW3hmUCNSV9kx4sQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20231025031144.5508-1-laoar.shao@gmail.com
2023-10-26samples/bpf: Allow building with custom bpftoolViktor Malik1-2/+3
samples/bpf build its own bpftool boostrap to generate vmlinux.h as well as some BPF objects. This is a redundant step if bpftool has been already built, so update samples/bpf/Makefile such that it accepts a path to bpftool passed via the BPFTOOL variable. The approach is practically the same as tools/testing/selftests/bpf/Makefile uses. Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/bd746954ac271b02468d8d951ff9f11e655d485b.1698213811.git.vmalik@redhat.com
2023-10-26samples/bpf: Fix passing LDFLAGS to libbpfViktor Malik1-1/+1
samples/bpf/Makefile passes LDFLAGS=$(TPROGS_LDFLAGS) to libbpf build without surrounding quotes, which may cause compilation errors when passing custom TPROGS_USER_LDFLAGS. For example: $ make -C samples/bpf/ TPROGS_USER_LDFLAGS="-Wl,--as-needed -specs=/usr/lib/gcc/x86_64-redhat-linux/13/libsanitizer.spec" make: Entering directory './samples/bpf' make -C ../../ M=./samples/bpf BPF_SAMPLES_PATH=./samples/bpf make[1]: Entering directory '.' make -C ./samples/bpf/../../tools/lib/bpf RM='rm -rf' EXTRA_CFLAGS="-Wall -O2 -Wmissing-prototypes -Wstrict-prototypes -I./usr/include -I./tools/testing/selftests/bpf/ -I./samples/bpf/libbpf/include -I./tools/include -I./tools/perf -I./tools/lib -DHAVE_ATTR_TEST=0" \ LDFLAGS=-Wl,--as-needed -specs=/usr/lib/gcc/x86_64-redhat-linux/13/libsanitizer.spec srctree=./samples/bpf/../../ \ O= OUTPUT=./samples/bpf/libbpf/ DESTDIR=./samples/bpf/libbpf prefix= \ ./samples/bpf/libbpf/libbpf.a install_headers make: invalid option -- 'c' make: invalid option -- '=' make: invalid option -- '/' make: invalid option -- 'u' make: invalid option -- '/' [...] Fix the error by properly quoting $(TPROGS_LDFLAGS). Suggested-by: Donald Zickus <dzickus@redhat.com> Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/c690de6671cc6c983d32a566d33fd7eabd18b526.1698213811.git.vmalik@redhat.com
2023-10-26samples/bpf: Allow building with custom CFLAGS/LDFLAGSViktor Malik1-1/+4
Currently, it is not possible to specify custom flags when building samples/bpf. The flags are defined in TPROGS_CFLAGS/TPROGS_LDFLAGS variables, however, when trying to override those from the make command, compilation fails. For example, when trying to build with PIE: $ make -C samples/bpf TPROGS_CFLAGS="-fpie" TPROGS_LDFLAGS="-pie" This is because samples/bpf/Makefile updates these variables, especially appends include paths to TPROGS_CFLAGS and these updates are overridden by setting the variables from the make command. This patch introduces variables TPROGS_USER_CFLAGS/TPROGS_USER_LDFLAGS for this purpose, which can be set from the make command and their values are propagated to TPROGS_CFLAGS/TPROGS_LDFLAGS. Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/2d81100b830a71f0e72329cc7781edaefab75f62.1698213811.git.vmalik@redhat.com
2023-10-26bareudp: use ports to lookup routeBeniamino Galvani1-13/+16
The source and destination ports should be taken into account when determining the route destination; they can affect the result, for example in case there are routing rules defined. Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231025094441.417464-1-b.galvani@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-26bpf: Add more WARN_ON_ONCE checks for mismatched alloc and freeHou Tao1-0/+4
There are two possible mismatched alloc and free cases in BPF memory allocator: 1) allocate from cache X but free by cache Y with a different unit_size 2) allocate from per-cpu cache but free by kmalloc cache or vice versa So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to spot these mismatched alloc and free early. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
2023-10-26Merge tag 'nf-next-23-10-25' of ↵Paolo Abeni12-502/+558
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Mostly nf_tables updates with two patches for connlabel and br_netfilter. 1) Rename function name to perform on-demand GC for rbtree elements, and replace async GC in rbtree by sync GC. Patches from Florian Westphal. 2) Use commit_mutex for NFT_MSG_GETRULE_RESET to ensure that two concurrent threads invoking this command do not underrun stateful objects. Patches from Phil Sutter. 3) Use single hook to deal with IP and ARP packets in br_netfilter. Patch from Florian Westphal. 4) Use atomic_t in netns->connlabel use counter instead of using a spinlock, also patch from Florian. 5) Cleanups for stateful objects infrastructure in nf_tables. Patches from Phil Sutter. 6) Flush path uses opaque set element offered by the iterator, instead of calling pipapo_deactivate() which looks up for it again. 7) Set backend .flush interface always succeeds, make it return void instead. 8) Add struct nft_elem_priv placeholder structure and use it by replacing void * to pass opaque set element representation from backend to frontend which defeats compiler type checks. 9) Shrink memory consumption of set element transactions, by reducing struct nft_trans_elem object size and reducing stack memory usage. 10) Use struct nft_elem_priv also for set backend .insert operation too. 11) Carry reset flag in nft_set_dump_ctx structure, instead of passing it as a function argument, from Phil Sutter. netfilter pull request 23-10-25 * tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST netfilter: nf_tables: shrink memory consumption of set elements netfilter: nf_tables: expose opaque set element as struct nft_elem_priv netfilter: nf_tables: set backend .flush always succeeds netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx netfilter: nf_tables: nft_obj_filter fits into cb->ctx netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx netfilter: nf_tables: A better name for nft_obj_filter netfilter: nf_tables: Unconditionally allocate nft_obj_filter netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj netfilter: conntrack: switch connlabels to atomic_t br_netfilter: use single forward hook for ip and arp netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests netfilter: nf_tables: Introduce nf_tables_getrule_single() netfilter: nf_tables: Open-code audit log call in nf_tables_getrule() netfilter: nft_set_rbtree: prefer sync gc to async worker netfilter: nft_set_rbtree: rename gc deactivate+erase function ==================== Link: https://lore.kernel.org/r/20231025212555.132775-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-26Merge branch ↵Jakub Kicinski2-7/+22
'net-ipv6-addrconf-ensure-that-temporary-addresses-preferred-lifetimes-are-in-the-valid-range' Alex Henrie says: ==================== net: ipv6/addrconf: ensure that temporary addresses' preferred lifetimes are in the valid range No changes from v2, but there are only four patches now because the first patch has already been applied. https://lore.kernel.org/all/20230829054623.104293-1-alexhenrie24@gmail.com/ ==================== Link: https://lore.kernel.org/r/20231024212312.299370-1-alexhenrie24@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Documentation: networking: explain what happens if temp_prefered_lft is too ↵Alex Henrie1-1/+5
small or too large Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231024212312.299370-5-alexhenrie24@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Documentation: networking: explain what happens if temp_valid_lft is too smallAlex Henrie1-1/+3
Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231024212312.299370-4-alexhenrie24@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26net: ipv6/addrconf: clamp preferred_lft to the minimum requiredAlex Henrie1-5/+13
If the preferred lifetime was less than the minimum required lifetime, ipv6_create_tempaddr would error out without creating any new address. On my machine and network, this error happened immediately with the preferred lifetime set to 1 second, after a few minutes with the preferred lifetime set to 4 seconds, and not at all with the preferred lifetime set to 5 seconds. During my investigation, I found a Stack Exchange post from another person who seems to have had the same problem: They stopped getting new addresses if they lowered the preferred lifetime below 3 seconds, and they didn't really know why. The preferred lifetime is a preference, not a hard requirement. The kernel does not strictly forbid new connections on a deprecated address, nor does it guarantee that the address will be disposed of the instant its total valid lifetime expires. So rather than disable IPv6 privacy extensions altogether if the minimum required lifetime swells above the preferred lifetime, it is more in keeping with the user's intent to increase the temporary address's lifetime to the minimum necessary for the current network conditions. With these fixes, setting the preferred lifetime to 3 or 4 seconds "just works" because the extra fraction of a second is practically unnoticeable. It's even possible to reduce the time before deprecation to 1 or 2 seconds by also disabling duplicate address detection (setting /proc/sys/net/ipv6/conf/*/dad_transmits to 0). I realize that that is a pretty niche use case, but I know at least one person who would gladly sacrifice performance and convenience to be sure that they are getting the maximum possible level of privacy. Link: https://serverfault.com/a/1031168/310447 Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231024212312.299370-3-alexhenrie24@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26net: ipv6/addrconf: clamp preferred_lft to the maximum allowedAlex Henrie1-0/+1
Without this patch, there is nothing to stop the preferred lifetime of a temporary address from being greater than its valid lifetime. If that was the case, the valid lifetime was effectively ignored. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231024212312.299370-2-alexhenrie24@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge branch 'ipv6-avoid-atomic-fragment-on-gso-output'Jakub Kicinski9-53/+27
Yan Zhai says: ==================== ipv6: avoid atomic fragment on GSO output When the ipv6 stack output a GSO packet, if its gso_size is larger than dst MTU, then all segments would be fragmented. However, it is possible for a GSO packet to have a trailing segment with smaller actual size than both gso_size as well as the MTU, which leads to an "atomic fragment". Atomic fragments are considered harmful in RFC-8021. An Existing report from APNIC also shows that atomic fragments are more likely to be dropped even it is equivalent to a no-op [1]. The series contains following changes: * drop feature RTAX_FEATURE_ALLFRAG, which has been broken. This helps simplifying other changes in this set. * refactor __ip6_finish_output code to separate GSO and non-GSO packet processing, mirroring IPv4 side logic. * avoid generating atomic fragment on GSO packets. Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] V4: https://lore.kernel.org/netdev/cover.1698114636.git.yan@cloudflare.com/ V3: https://lore.kernel.org/netdev/cover.1697779681.git.yan@cloudflare.com/ V2: https://lore.kernel.org/netdev/ZS1%2Fqtr0dZJ35VII@debian.debian/ ==================== Link: https://lore.kernel.org/r/cover.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26ipv6: avoid atomic fragment on GSO packetsYan Zhai1-1/+7
When the ipv6 stack output a GSO packet, if its gso_size is larger than dst MTU, then all segments would be fragmented. However, it is possible for a GSO packet to have a trailing segment with smaller actual size than both gso_size as well as the MTU, which leads to an "atomic fragment". Atomic fragments are considered harmful in RFC-8021. An Existing report from APNIC also shows that atomic fragments are more likely to be dropped even it is equivalent to a no-op [1]. Add an extra check in the GSO slow output path. For each segment from the original over-sized packet, if it fits with the path MTU, then avoid generating an atomic fragment. Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") Reported-by: David Wragg <dwragg@cloudflare.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Link: https://lore.kernel.org/r/90912e3503a242dca0bc36958b11ed03a2696e5e.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26ipv6: refactor ip6_finish_output for GSO handlingYan Zhai1-7/+15
Separate GSO and non-GSO packets handling to make the logic cleaner. For GSO packets, frag_max_size check can be omitted because it is only useful for packets defragmented by netfilter hooks. Both local output and GRO logic won't produce GSO packets when defragment is needed. This also mirrors what IPv4 side code is doing. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/0e1d4599f858e2becff5c4fe0b5f843236bc3fe8.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26ipv6: drop feature RTAX_FEATURE_ALLFRAGYan Zhai9-45/+5
RTAX_FEATURE_ALLFRAG was added before the first git commit: https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html The feature would send packets to the fragmentation path if a box receives a PMTU value with less than 1280 byte. However, since commit 9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such message would be simply discarded. The feature flag is neither supported in iproute2 utility. In theory one can still manipulate it with direct netlink message, but it is not ideal because it was based on obsoleted guidance of RFC-2460 (replaced by RFC-8200). The feature would always test false at the moment, so remove related code or mark them as unused. Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26iavf: in iavf_down, disable queues when removing the driverMichal Schmidt1-1/+1
In iavf_down, we're skipping the scheduling of certain operations if the driver is being removed. However, the IAVF_FLAG_AQ_DISABLE_QUEUES request must not be skipped in this case, because iavf_close waits for the transition to the __IAVF_DOWN state, which happens in iavf_virtchnl_completion after the queues are released. Without this fix, "rmmod iavf" takes half a second per interface that's up and prints the "Device resources not yet released" warning. Fixes: c8de44b577eb ("iavf: do not process adminq tasks when __IAVF_IN_REMOVE_TASK is set") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231025183213.874283-1-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge tag 'nf-23-10-25' of ↵Jakub Kicinski3-7/+17
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net This patch contains two late Netfilter's flowtable fixes for net: 1) Flowtable GC pushes back packets to classic path in every GC run, ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset by the time the flow is offloaded (this status bit is only reliable in the sched/act_ct datapath). 2) sched/act_ct logic to push back packets to classic path to reevaluate if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is set on and no hardware offload request is pending to be handled. From Vlad Buslov. These two patches fixes two problems that were introduced in the previous 6.5 development cycle. * tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: net/sched: act_ct: additional checks for outdated flows netfilter: flowtable: GC pushes back packets to classic path ==================== Link: https://lore.kernel.org/r/20231025100819.2664-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26vsock/virtio: initialize the_virtio_vsock before using VQsAlexandru Matei1-1/+17
Once VQs are filled with empty buffers and we kick the host, it can send connection requests. If the_virtio_vsock is not initialized before, replies are silently dropped and do not reach the host. virtio_transport_send_pkt() can queue packets once the_virtio_vsock is set, but they won't be processed until vsock->tx_run is set to true. We queue vsock->send_pkt_work when initialization finishes to send those packets queued earlier. Fixes: 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock") Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20231024191742.14259-1-alexandru.matei@uipath.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25Merge branch 'mptcp-features-and-fixes-for-v6-7'Jakub Kicinski8-77/+224
Mat Martineau says: ==================== mptcp: Features and fixes for v6.7 Patch 1 adds a configurable timeout for the MPTCP connection when all subflows are closed, to support break-before-make use cases. Patch 2 is a fix for a 1-byte error in rx data counters with MPTCP fastopen connections. Patch 3 is a minor code cleanup. Patches 4 & 5 add handling of rcvlowat for MPTCP sockets, with a prerequisite patch to use a common scaling ratio between TCP and MPTCP. Patch 6 improves efficiency of memory copying in MPTCP transmit code. Patch 7 refactors syncing of socket options from the MPTCP socket to its subflows. Patches 8 & 9 help the MPTCP packet scheduler perform well by changing the handling of notsent_lowat in subflows and how available buffer space is calculated for MPTCP-level sends. ==================== Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-0-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: refactor sndbuf auto-tuningPaolo Abeni4-10/+70
The MPTCP protocol account for the data enqueued on all the subflows to the main socket send buffer, while the send buffer auto-tuning algorithm set the main socket send buffer size as the max size among the subflows. That causes bad performances when at least one subflow is sndbuf limited, e.g. due to very high latency, as the MPTCP scheduler can't even fill such buffer. Change the send-buffer auto-tuning algorithm to compute the main socket send buffer size as the sum of all the subflows buffer size. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-9-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: ignore notsent_lowat setting at the subflow levelPaolo Abeni1-0/+6
Any latency related tuning taking action at the subflow level does not really affect the user-space, as only the main MPTCP socket is relevant. Anyway any limiting setting may foul the MPTCP scheduler, not being able to fully use the subflow-level cwin, leading to very poor b/w usage. Enforce notsent_lowat to be a no-op on every subflow. Note that TCP_NOTSENT_LOWAT is currently not supported, and properly dealing with that will require more invasive changes. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-8-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: consolidate sockopt synchronizationPaolo Abeni3-33/+9
Move the socket option synchronization for active subflows at subflow creation time. This allows removing the now unused unlocked variant of such helper. While at that, clean-up a bit the mptcp_subflow_create_socket() errors path. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-7-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: use copy_from_iter helpers on transmitPaolo Abeni1-4/+15
The perf traces show an high cost for the MPTCP transmit path memcpy. It turn out that the helper currently in use carries quite a bit of unneeded overhead, e.g. to map/unmap the memory pages. Moving to the 'copy_from_iter' variant removes such overhead and additionally gains the no-cache support. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-6-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: give rcvlowat some lovePaolo Abeni4-15/+83
The MPTCP protocol allow setting sk_rcvlowat, but the value there is currently ignored. Additionally, the default subflows sk_rcvlowat basically disables per subflow delayed ack: the MPTCP protocol move the incoming data from the subflows into the msk socket as soon as the TCP stacks invokes the subflow data_ready callback. Later, when __tcp_ack_snd_check() takes action, the subflow-level copied_seq matches rcv_nxt, and that mandate for an immediate ack. Let the mptcp receive path be aware of such threshold, explicitly tracking the amount of data available to be ready and checking vs sk_rcvlowat in mptcp_poll() and before waking-up readers. Additionally implement the set_rcvlowat() callback, to properly handle the rcvbuf auto-tuning on sk_rcvlowat changes. Finally to properly handle delayed ack, force the subflow level threshold to 0 and instead explicitly ask for an immediate ack when the msk level th is not reached. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-5-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25tcp: define initial scaling factor value as a macroPaolo Abeni1-5/+7
So that other users could access it. Notably MPTCP will use it in the next patch. No functional change intended. Acked-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-4-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: use plain bool instead of custom binary enumPaolo Abeni2-12/+7
The 'data_avail' subflow field is already used as plain boolean, drop the custom binary enum type and switch to bool. No functional changed intended. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-3-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: properly account fastopen dataPaolo Abeni1-0/+1
Currently the socket level counter aggregating the received data does not take in account the data received via fastopen. Address the issue updating the counter as required. Fixes: 38967f424b5b ("mptcp: track some aggregate data counters") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-2-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>