summaryrefslogtreecommitdiff
path: root/fs/dlm
AgeCommit message (Collapse)AuthorFilesLines
2023-06-29Merge tag 'dlm-6.5' of ↵Linus Torvalds13-222/+203
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "The dlm posix lock handling (for gfs2) has three notable changes: - Local pids returned from GETLK are no longer negated. A previous patch negating remote pids mistakenly changed local pids also. - SETLKW operations can now be interrupted only when the process is killed, and not from other signals. General interruption was resulting in previously acquired locks being cleared, not just the in-progress lock. Handling this correctly will require extending a cancel capability to user space (a future feature.) - If multiple threads are requesting posix locks (with SETLKW), fix incorrect matching of results to the requests. The dlm networking has several minor cleanups, and one notable change: - Avoid delaying ack messages for too long (used for message reliability), resulting in a backlog of un-acked messages. These could previously be delayed as a result of either too many or too few other messages being sent. Now an upper and lower threshold is used to determine when an ack should be sent" * tag 'dlm-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: remove filter local comms on close fs: dlm: add send ack threshold and append acks to msgs fs: dlm: handle sequence numbers as atomic fs: dlm: handle lkb wait count as atomic_t fs: dlm: filter ourself midcomms calls fs: dlm: warn about messages from left nodes fs: dlm: move dlm_purge_lkb_callbacks to user module fs: dlm: cleanup STOP_IO bitflag set when stop io fs: dlm: don't check othercon twice fs: dlm: unregister memory at the very last fs: dlm: fix missing pending to false fs: dlm: clear pending bit when queue was empty fs: dlm: revert check required context while close fs: dlm: fix mismatch of plock results from userspace fs: dlm: make F_SETLK use unkillable wait_event fs: dlm: interrupt posix locks only when process is killed fs: dlm: fix cleanup pending ops when interrupted fs: dlm: return positive pid value for F_GETLK dlm: Replace all non-returning strlcpy with strscpy
2023-06-29Merge tag 'net-next-6.5' of ↵Linus Torvalds1-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-25dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpageDavid Howells1-3/+7
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christine Caulfield <ccaulfie@redhat.com> cc: David Teigland <teigland@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: cluster-devel@redhat.com Link: https://lore.kernel.org/r/20230623225513.2732256-7-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20fs: dlm: remove filter local comms on closeAlexander Aring1-2/+1
The current way how lowcomms is configured is due configfs entries. Each comms configfs entry will create a lowcomms connection. Even the local connection itself will be stored as a lowcomms connection, although most functionality for a local lowcomms connection struct is not necessary. Now in some scenarios we will see that dlm_controld reports a -EEXIST when configure a node via configfs: ... /sys/kernel/config/dlm/cluster/comms/1/addr: write failed: 17 -1 Doing a: cat /sys/kernel/config/dlm/cluster/comms/1/addr_list reported nothing. This was being seen on cluster with nodeid 1 and it's local configuration. To be sure the configfs entries are in sync with lowcomms connection structures we always call dlm_midcomms_close() to be sure the lowcomms connection gets removed when the configfs entry gets dropped. Before commit 07ee38674a0b ("fs: dlm: filter ourself midcomms calls") it was just doing this by accident and the filter by doing: if (nodeid == dlm_our_nodeid()) return 0; inside dlm_midcomms_close() was never been hit because drop_comm() sets local_comm to NULL and cause that dlm_our_nodeid() returns always the invalid nodeid 0. Fixes: 07ee38674a0b ("fs: dlm: filter ourself midcomms calls") Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: add send ack threshold and append acks to msgsAlexander Aring2-75/+31
This patch changes the time when we sending an ack back to tell the other side it can free some message because it is arrived on the receiver node, due random reconnects e.g. TCP resets this is handled as well on application layer to not let DLM run into a deadlock state. The current handling has the following problems: 1. We end in situations that we only send an ack back message of 16 bytes out and no other messages. Whereas DLM has logic to combine so much messages as it can in one send() socket call. This behaviour can be discovered by "trace-cmd start -e dlm_recv" and observing the ret field being 16 bytes. 2. When processing of DLM messages will never end because we receive a lot of messages, we will not send an ack back as it happens when the processing loop ends. This patch introduces a likely and unlikely threshold case. The likely case will send an ack back on a transmit path if the threshold is triggered of amount of processed upper layer protocol. This will solve issue 1 because it will be send when another normal DLM message will be sent. It solves issue 2 because it is not part of the processing loop. There is however a unlikely case, the unlikely case has a bigger threshold and will be triggered when we only receive messages and do not sent any message back. This case avoids that the sending node will keep a lot of message for a long time as we send sometimes ack backs to tell the sender to finally release messages. The atomic cmpxchg() is there to provide a atomically ack send with reset of the upper layer protocol delivery counter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: handle sequence numbers as atomicAlexander Aring1-15/+25
Currently seq_next is only be read on the receive side which processed in an ordered way. The seq_send is being protected by locks. To being able to read the seq_next value on send side as well we convert it to an atomic_t value. The atomic_cmpxchg() is probably not necessary, however the atomic_inc() depends on a if coniditional and this should be handled in an atomic context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: handle lkb wait count as atomic_tAlexander Aring2-19/+15
Currently the lkb_wait_count is locked by the rsb lock and it should be fine to handle lkb_wait_count as non atomic_t value. However for the overall process of reducing locking this patch converts it to an atomic_t value. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: filter ourself midcomms callsAlexander Aring4-20/+32
It makes no sense to call midcomms/lowcomms functionality for the local node as socket functionality is only required for remote nodes. This patch filters those calls in the upper layer of lockspace membership handling instead of doing it in midcomms/lowcomms layer as they should never be aware of local nodeid. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: warn about messages from left nodesAlexander Aring1-2/+2
This patch warns about messages which are received from nodes who already left the lockspace resource signaled by the cluster manager. Before commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") there was a synchronization issue with the socket lifetime and the cluster event of leaving a lockspace and other nodes did not stop of sending messages because the cluster manager has a pending message to leave the lockspace. The reliable session layer for dlm use sequence numbers to ensure dlm message were never being dropped. If this is not corrected synchronized we have a problem, this patch will use the filter case and turn it into a WARN_ON_ONCE() so we seeing such issue on the kernel log because it should never happen now. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: move dlm_purge_lkb_callbacks to user moduleAlexander Aring4-18/+19
This patch moves the dlm_purge_lkb_callbacks() function from ast to user dlm module as it is only a function being used by dlm user implementation. I got be hinted to hold specific locks regarding the callback handling for dlm_purge_lkb_callbacks() but it was false positive. It is confusing because ast dlm implementation uses a different locking behaviour as user locks uses as DLM handles kernel and user dlm locks differently. To avoid the confusing we move this function to dlm user implementation. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: cleanup STOP_IO bitflag set when stop ioAlexander Aring1-8/+4
There should no difference between setting the CF_IO_STOP flag before restore_callbacks() to do it before or afterwards. The restore_callbacks() will be sure that no callback is executed anymore when the bit wasn't set. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: don't check othercon twiceAlexander Aring1-2/+1
This patch removes an another check if con->othercon set inside the branch which already does that. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: unregister memory at the very lastAlexander Aring1-1/+1
The dlm modules midcomms, debugfs, lockspace, uses kmem caches. We ensure that the kmem caches getting deallocated after those modules exited. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: fix missing pending to falseAlexander Aring1-0/+1
This patch sets the process_dlm_messages_pending boolean to false when there was no message to process. It is a case which should not happen but if we are prepared to recover from this situation by setting pending boolean to false. Cc: stable@vger.kernel.org Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: clear pending bit when queue was emptyAlexander Aring1-3/+5
This patch clears the DLM_IFL_CB_PENDING_BIT flag which will be set when there is callback work queued when there was no callback to dequeue. It is a buggy case and should never happen, that's why there is a WARN_ON(). However if the case happens we are prepared to somehow recover from it. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: revert check required context while closeAlexander Aring3-16/+0
This patch reverts commit 2c3fa6ae4d52 ("dlm: check required context while close"). The function dlm_midcomms_close(), which will call later dlm_lowcomms_close(), is called when the cluster manager tells the node got fenced which means on midcomms/lowcomms layer to disconnect the node from the cluster communication. The node can rejoin the cluster later. This patch was ensuring no new message were able to be triggered when we are in the close() function context. This was done by checking if the lockspace has been stopped. However there is a missing check that we only need to check specific lockspaces where the fenced node is member of. This is currently complicated because there is no way to easily check if a node is part of a specific lockspace without stopping the recovery. For now we just revert this commit as it is just a check to finding possible leaks of stopping lockspaces before close() is called. Cc: stable@vger.kernel.org Fixes: 2c3fa6ae4d52 ("dlm: check required context while close") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-24fs: dlm: fix mismatch of plock results from userspaceAlexander Aring1-13/+45
When a waiting plock request (F_SETLKW) is sent to userspace for processing (dlm_controld), the result is returned at a later time. That result could be incorrectly matched to a different waiting request in cases where the owner field is the same (e.g. different threads in a process.) This is fixed by comparing all the properties in the request and reply. The results for non-waiting plock requests are now matched based on list order because the results are returned in the same order they were sent. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-23fs: dlm: make F_SETLK use unkillable wait_eventAlexander Aring1-17/+21
While a non-waiting posix lock request (F_SETLK) is waiting for user space processing (in dlm_controld), wait for that processing to complete with an unkillable wait_event(). This makes F_SETLK behave the same way for F_RDLCK, F_WRLCK and F_UNLCK. F_SETLKW continues to use wait_event_killable(). Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22fs: dlm: interrupt posix locks only when process is killedAlexander Aring1-1/+1
If a posix lock request is waiting for a result from user space (dlm_controld), do not let it be interrupted unless the process is killed. This reverts commit a6b1533e9a57 ("dlm: make posix locks interruptible"). The problem with the interruptible change is that all locks were cleared on any signal interrupt. If a signal was received that did not terminate the process, the process could continue running after all its dlm posix locks had been cleared. A future patch will add cancelation to allow proper interruption. Cc: stable@vger.kernel.org Fixes: a6b1533e9a57 ("dlm: make posix locks interruptible") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22dlm: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230510221237.3509484-1-azeemshaikh38@gmail.com
2023-05-22fs: dlm: fix cleanup pending ops when interruptedAlexander Aring1-19/+6
Immediately clean up a posix lock request if it is interrupted while waiting for a result from user space (dlm_controld.) This largely reverts the recent commit b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling"). That previous commit attempted to defer lock cleanup to the point in time when a result from user space arrived. The deferred approach was not reliable because some dlm plock ops may not receive replies. Cc: stable@vger.kernel.org Fixes: b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22fs: dlm: return positive pid value for F_GETLKAlexander Aring1-1/+3
The GETLK pid values have all been negated since commit 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks"). Revert this for local pids, and leave in place negative pids for remote owners. Cc: stable@vger.kernel.org Fixes: 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22dlm: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-04-27Merge tag 'net-next-6.4' of ↵Linus Torvalds1-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ...
2023-04-21fs: dlm: stop unnecessarily filling zero ms_extra bytesAlexander Aring1-1/+1
Commit 7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr") fixes an issue when the lkb->lkb_lvbptr set to an dangled pointer and an followed memcpy() would fail. It was fixed by an additional check of DLM_LKF_VALBLK flag. The mentioned commit forgot to add an additional check if DLM_LKF_VALBLK is set for the additional amount of LVB data allocated in a dlm message. This patch is changing the message allocation to check additionally if DLM_LKF_VALBLK is set otherwise a dangled lkb->lkb_lvbptr pointer would allocated zero LVB message data which not gets filled with actual data. This patch is however only a cleanup to reduce the amount of zero bytes transmitted over network as receive_lvb() will only evaluates message LVB data if DLM_LKF_VALBLK is set. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-17net: annotate lockless accesses to sk->sk_err_softEric Dumazet1-3/+4
This field can be read/written without lock synchronization. tcp and dccp have been handled in different patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-07fs: dlm: switch lkb_sbflags to atomic opsAlexander Aring2-12/+38
This patch moves lkb_sbflags handling to atomic bits ops. This should prepare for a possible manipulating of lkb_sbflags flags at the same time by concurrent execution. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: rsb hash table flag value to atomic opsAlexander Aring2-6/+6
This patch moves the rsb hash table handling to atomic flag operations. The flag operations for DLM_RTF_SHRINK are protected by ls->ls_rsbtbl[b].lock. However we switch to atomic ops if new possible flags will be used in a different way and don't assume such lock dependencies. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: move internal flags to atomic opsAlexander Aring6-83/+90
This patch will move the lkb_flags value to the recently introduced lkb_iflags value. For lkb_iflags we use atomic bit operations because some flags like DLM_IFL_CB_PENDING are used while non rsb lock is held to avoid issues with other flag manipulations which might run at the same time we switch to atomic bit operations. Snapshot the bit values to an uint32_t value is only used for debugging/logging use cases and don't need to be 100% correct. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: change dflags to use atomic bitsAlexander Aring7-23/+63
Currently manipulating lkb_dflags assumes to held the rsb lock assigned to the lkb. This is held by dlm message processing after certain time to lookup the right rsb from the received lkb message id. For user space locks flags, which is currently the only use case for lkb_dflags, flags are also being set during dlm character device handling without holding the rsb lock. To minimize the risk that bit operations are getting corrupted we switch to atomic bit operations. This patch will also introduce helpers to snapshot atomic bit values in an non atomic way. There might be still issues with the flag handling e.g. running in case of manipulating bit ops and snapshot them at the same time, but this patch minimize them and will start to use atomic bit operations. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: store lkb distributed flags into own valueAlexander Aring7-32/+27
This patch stores lkb distributed flags value in an separate value instead of sharing internal and distributed flags in lkb->lkb_flags value. This has the advantage to not mask/write back flag values in receive_flags() functionality. The dlm debug_fs does not provide the distributed flags anymore, those can be added in future. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: remove DLM_IFL_LOCAL_MS flagAlexander Aring2-33/+33
The DLM_IFL_LOCAL_MS flag is an internal non shared flag but used in m_flags of dlm messages. It is not shared because it is only used for local messaging. Instead using DLM_IFL_LOCAL_MS in dlm messages we pass a parameter around to signal local messaging or not. This patch is adding the local parameter to signal local messaging. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: rename stub to local message flagAlexander Aring3-59/+59
This patch renames DLM_IFL_STUB_MS to DLM_IFL_LOCAL_MS flag. The DLM_IFL_STUB_MS flag is somewhat misnamed, it means the dlm message is used for local message transfer only. It is used by recovery to resolve lock states if a node got fenced. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: remove deprecated code partsAlexander Aring12-469/+1
This patch removes code parts which was declared deprecated by commit 6b0afc0cc3e9 ("fs: dlm: don't use deprecated timeout features by default"). This contains the following dlm functionality: - start a cancel of a dlm request did not complete after certain timeout: The current way how dlm cancellation works and interfering with other dlm requests triggered by the user can end in an overlapping and returning in -EBUSY. The most user don't handle this case and are unaware that DLM can return such errno in such situation. Due the timeout the user are mostly unaware when this happens. - start a netlink warning messages for user space if dlm requests did not complete after certain timeout: This feature was never being built in the only known dlm user space side. As we are to remove the timeout cancellation feature we can directly remove this feature as well. There might be the possibility to bring the timeout cancellation feature back. However the current way of handling the -EBUSY case which is only a software limitation and not a hardware limitation should be changed. We minimize the current code base in DLM cancellation feature to not have to deal with those existing features while solving the DLM cancellation feature in general. UAPI define DLM_LSFL_TIMEWARN is commented as deprecated and reserved value. We should avoid at first to give it a new meaning but let possible users still compile by keeping this define. In far future we can give this flag a new meaning. The same for the DLM_LKF_TIMEOUT lock request flag. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07DLM: increase socket backlog to avoid hangs with 16 nodesEdwin Török1-1/+1
On a 16 node virtual cluster with e1000 NICs joining the 12th node prints SYN flood warnings for the DLM port: Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters. And then joining a DLM lockspace hangs: ``` Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10 Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ``` The previous limit of 5 seems like an arbitrary number, that doesn't match any known DLM cluster size upper bound limit. Signed-off-by: Edwin Török <edvin.torok@citrix.com> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: add unbound flag to dlm_io workqueueAlexander Aring1-2/+2
This patch will add the WQ_UNBOUND flag to the lowcomms dlm_io workqueue which handles socket io handling to send and receive dlm messages. The amount of sockets will be 2 for a 3 node cluster. Each socket has two different workers for doing send and receive work by calling socket API functionality. Each worker will do their task in order to send dlm messages in a ordered stream based socket communication. On receive side the receive buffer will be queued up for an ordered dlm_process workqueue to parse received dlm messages. The parsing need to be done currently in an ordered synchronized way because the dlm message processing is not being made to parse parallel. After explaining all those workqueue behaviours in lowcomms, the dlm_io workqueue is only being used for socket handling. Each socket handling has 2 workers (send and receive). In a 3 cluster node we will end up with 4 workers. Without the WQ_UNBOUND flag the workers are tight to a CPU and can never switch, this could be an advantage because local CPU execution. However with dlm_locktorture testcase I expierenced not all workers are always in use and my assumption is that some workers are bound to the same CPU. We should always send or receive when we are ready to do so, one reason why we disable nigel algorithm on sockets. We should be safe to do the socket io handling on any CPU which can be switched during runtime. There is no assumption that the worker stays on the same CPU. There is no need to respect any workqueue concurrency model that each worker can only run on one CPU. Lowcomms queue_work() mechanism has an higher level flag to be sure that it can't schedule work if the previous worker did not signal it to keep ordered socket handling. Therefore this patch sets the WQ_UNBOUND flag to allow workers being executed by any available CPU. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-07fs: dlm: fix DLM_IFL_CB_PENDING gets overwrittenAlexander Aring3-7/+9
This patch introduce a new internal flag per lkb value to handle internal flags which are handled not on wire. The current lkb internal flags stored as lkb->lkb_flags are split in upper and lower bits, the lower bits are used to share internal flags over wire for other cluster wide lkb copies on other nodes. In commit 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") we introduced a new internal flag for pending callbacks for the dlm callback queue. This flag is protected by the lkb->lkb_cb_lock lock. This patch overlooked that on dlm receive path and the mentioned upper and lower bits, that dlm will read the flags, mask it and write it back. As example receive_flags() in fs/dlm/lock.c. This flag manipulation is not done atomically and is not protected by lkb->lkb_cb_lock. This has unknown side effects of the current callback handling. In future we should move to set/clear/test bit functionality and avoid read, mask and writing back flag values. In later patches we will move the upper parts to the new introduced internal lkb flags which are not shared between other cluster nodes to the new non shared internal flag field to avoid similar issues. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Reported-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-02-24Merge tag 'driver-core-6.3-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.3-rc1. There's a lot of changes this development cycle, most of the work falls into two different categories: - fw_devlink fixes and updates. This has gone through numerous review cycles and lots of review and testing by lots of different devices. Hopefully all should be good now, and Saravana will be keeping a watch for any potential regression on odd embedded systems. - driver core changes to work to make struct bus_type able to be moved into read-only memory (i.e. const) The recent work with Rust has pointed out a number of areas in the driver core where we are passing around and working with structures that really do not have to be dynamic at all, and they should be able to be read-only making things safer overall. This is the contuation of that work (started last release with kobject changes) in moving struct bus_type to be constant. We didn't quite make it for this release, but the remaining patches will be finished up for the release after this one, but the groundwork has been laid for this effort. Other than that we have in here: - debugfs memory leak fixes in some subsystems - error path cleanups and fixes for some never-able-to-be-hit codepaths. - cacheinfo rework and fixes - Other tiny fixes, full details are in the shortlog All of these have been in linux-next for a while with no reported problems" [ Geert Uytterhoeven points out that that last sentence isn't true, and that there's a pending report that has a fix that is queued up - Linus ] * tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (124 commits) debugfs: drop inline constant formatting for ERR_PTR(-ERROR) OPP: fix error checking in opp_migrate_dentry() debugfs: update comment of debugfs_rename() i3c: fix device.h kernel-doc warnings dma-mapping: no need to pass a bus_type into get_arch_dma_ops() driver core: class: move EXPORT_SYMBOL_GPL() lines to the correct place Revert "driver core: add error handling for devtmpfs_create_node()" Revert "devtmpfs: add debug info to handle()" Revert "devtmpfs: remove return value of devtmpfs_delete_node()" driver core: cpu: don't hand-override the uevent bus_type callback. devtmpfs: remove return value of devtmpfs_delete_node() devtmpfs: add debug info to handle() driver core: add error handling for devtmpfs_create_node() driver core: bus: update my copyright notice driver core: bus: add bus_get_dev_root() function driver core: bus: constify bus_unregister() driver core: bus: constify some internal functions driver core: bus: constify bus_get_kset() driver core: bus: constify bus_register/unregister_notifier() driver core: remove private pointer from struct bus_type ...
2023-02-22Merge tag 'net-next-6.3' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-02-21Merge tag 'dlm-6.3' of ↵Linus Torvalds6-97/+136
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This fixes some races in the lowcomms startup and shutdown code that were found by targeted stress testing that quickly and repeatedly joins and leaves lockspaces" * tag 'dlm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: remove unnecessary waker_up() calls fs: dlm: move state change into else branch fs: dlm: remove newline in log_print fs: dlm: reduce the shutdown timeout to 5 secs fs: dlm: make dlm sequence id more robust fs: dlm: wait until all midcomms nodes detect version fs: dlm: ignore unexpected non dlm opts msgs fs: dlm: bring back previous shutdown handling fs: dlm: send FIN ack back in right cases fs: dlm: move sending fin message into state change handling fs: dlm: don't set stop rx flag after node reset fs: dlm: fix race setting stop tx flag fs: dlm: be sure to call dlm_send_queue_flush() fs: dlm: fix use after free in midcomms commit fs: dlm: start midcomms before scand fs/dlm: Remove "select SRCU" fs: dlm: fix return value check in dlm_memory_init()
2023-01-27kobject: kset_uevent_ops: make uevent() callback take a const *Greg Kroah-Hartman1-2/+2
The uevent() callback in struct kset_uevent_ops does not modify the kobject passed into it, so make the pointer const to enforce this restriction. When doing so, fix up all existing uevent() callbacks to have the correct signature to preserve the build. Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-17-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24fs: dlm: remove unnecessary waker_up() callsAlexander Aring1-2/+0
The wake_up() is already handled inside of midcomms_node_reset() when switching the state to CLOSED state. So there is not need to call it after midcomms_node_reset() again. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-24fs: dlm: move state change into else branchAlexander Aring1-3/+4
Currently we can switch at first into DLM_CLOSE_WAIT state and then do another state change if a condition is true. Instead of doing two state changes we handle the other state change inside an else branch of this condition. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-24fs: dlm: remove newline in log_printAlexander Aring1-4/+4
There is an API difference between log_print() and other printk()s to put a newline or not. This one was introduced by mistake because log_print() adds a newline. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-24fs: dlm: reduce the shutdown timeout to 5 secsAlexander Aring1-2/+2
When a shutdown is stuck, time out after 5 seconds instead of 3 minutes. After this timeout we try a forced shutdown. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-24fs: dlm: make dlm sequence id more robustAlexander Aring1-1/+1
When joining a new lockspace, use a random number to initialize a sequence number used in messages. This makes it easier to detect sequence number mismatches in message replies during tests that repeatedly join and leave a lockspace. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: wait until all midcomms nodes detect versionAlexander Aring3-0/+27
The current dlm version detection is very complex due to backwards compatablilty with earlier dlm protocol versions. It takes some time to detect if a peer node has a specific DLM version. If it's not detected, we just cut the socket connection. There could be cases where the local node has not detected the version yet, but the peer node has. In these cases, we are trying to shutdown the dlm connection with a FIN/ACK message exchange to be sure the other peer is ready to shutdown the connection on dlm application level. However this mechanism is only available on DLM protocol version 3.2 and we need to be sure the DLM version is detected before. To make it more robust we introduce a a "best effort" wait to wait for the version detection before shutdown the dlm connection. This need to be done before the kthread recoverd for recovery handling is stopped, because recovery handling will trigger enough messages to have a version detection going on. It is a corner case which was detected by modprobe dlm_locktroture module and rmmod dlm_locktorture module directly afterwards (in a looping behaviour). In practice probably nobody would leave a lockspace immediately after joining it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: ignore unexpected non dlm opts msgsAlexander Aring1-10/+2
This patch ignores unexpected RCOM_NAMES/RCOM_STATUS messages. To be backwards compatible, those messages are not part of the new reliable DLM OPTS encapsulation header, and have their own retransmit handling using sequence number matching When we get unexpected non dlm opts messages, we should allow them and let RCOM message handling filter them out using sequence numbers. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: bring back previous shutdown handlingAlexander Aring2-34/+63
This patch mostly reverts commit 4f567acb0b86 ("fs: dlm: remove socket shutdown handling"). There can be situations where the dlm midcomms nodes hash and lowcomms connection hash are not equal, but we need to guarantee that the lowcomms are all closed on a last release of a dlm lockspace, when a shutdown is invoked. This patch guarantees that we always close all sockets managed by the lowcomms connection hash, and calls shutdown for the last message sent. This ensures we don't cut the socket, which could cause the peer to get a connection reset. In future we should try to merge the midcomms/lowcomms hashes into one hash and not handle both in separate hashes. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: send FIN ack back in right casesAlexander Aring1-4/+5
This patch moves to send a ack back for receiving a FIN message only when we are in valid states. In other cases and there might be a sender waiting for a ack we just let it timeout at the senders time and hopefully all other cleanups will remove the FIN message on their sending queue. As an example we should never send out an ACK being in LAST_ACK state or we cannot assume a working socket communication when we are in CLOSED state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>