From af0335d2926e1c597c247956cd608b6e8c9d6463 Mon Sep 17 00:00:00 2001 From: Joe Stringer Date: Sat, 22 Apr 2023 10:20:53 -0700 Subject: docs/bpf: Add table to describe LRU properties Depending on the map type and flags for LRU, different properties are global or percpu. Add a table to describe these. Signed-off-by: Joe Stringer Signed-off-by: Daniel Borkmann Acked-by: John Fastabend Link: https://lore.kernel.org/bpf/20230422172054.3355436-1-joe@isovalent.com --- Documentation/bpf/map_hash.rst | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst index 8669426264c6..1314dfc5e7e1 100644 --- a/Documentation/bpf/map_hash.rst +++ b/Documentation/bpf/map_hash.rst @@ -29,7 +29,16 @@ will automatically evict the least recently used entries when the hash table reaches capacity. An LRU hash maintains an internal LRU list that is used to select elements for eviction. This internal LRU list is shared across CPUs but it is possible to request a per CPU LRU list with -the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. +the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. The +following table outlines the properties of LRU maps depending on the a +map type and the flags used to create the map. + +======================== ========================= ================================ +Flag ``BPF_MAP_TYPE_LRU_HASH`` ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` +======================== ========================= ================================ +**BPF_F_NO_COMMON_LRU** Per-CPU LRU, global map Per-CPU LRU, per-cpu map +**!BPF_F_NO_COMMON_LRU** Global LRU, global map Global LRU, per-cpu map +======================== ========================= ================================ Usage ===== -- cgit v1.2.3 From 1a986518b8a517637f70cd6d7d494bd0cbbf6145 Mon Sep 17 00:00:00 2001 From: Joe Stringer Date: Sat, 22 Apr 2023 10:20:54 -0700 Subject: docs/bpf: Add LRU internals description and graph Extend the bpf hashmap docs to include a brief description of the internals of the LRU map type (setting appropriate API expectations), including the original commit message from Martin and a variant on the graph that I had presented during my Linux Plumbers Conference 2022 talk on "Pressure feedback for LRU map types"[0]. The node names in the dot file correspond roughly to the functions where the logic for those decisions or steps is defined, to help curious developers to cross-reference and update this logic if the details of the LRU implementation ever differ from this description. [0] https://lpc.events/event/16/contributions/1368/ Signed-off-by: Joe Stringer Signed-off-by: Daniel Borkmann Reviewed-by: Bagas Sanjaya Acked-by: John Fastabend Link: https://lore.kernel.org/bpf/20230422172054.3355436-2-joe@isovalent.com --- Documentation/bpf/map_hash.rst | 42 ++++++++ Documentation/bpf/map_lru_hash_update.dot | 172 ++++++++++++++++++++++++++++++ 2 files changed, 214 insertions(+) create mode 100644 Documentation/bpf/map_lru_hash_update.dot (limited to 'Documentation') diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst index 1314dfc5e7e1..d2343952f2cb 100644 --- a/Documentation/bpf/map_hash.rst +++ b/Documentation/bpf/map_hash.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only .. Copyright (C) 2022 Red Hat, Inc. +.. Copyright (C) 2022-2023 Isovalent, Inc. =============================================== BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants @@ -215,3 +216,44 @@ Userspace walking the map elements from the map declared above: cur_key = &next_key; } } + +Internals +========= + +This section of the document is targeted at Linux developers and describes +aspects of the map implementations that are not considered stable ABI. The +following details are subject to change in future versions of the kernel. + +``BPF_MAP_TYPE_LRU_HASH`` and variants +-------------------------------------- + +Updating elements in LRU maps may trigger eviction behaviour when the capacity +of the map is reached. There are various steps that the update algorithm +attempts in order to enforce the LRU property which have increasing impacts on +other CPUs involved in the following operation attempts: + +- Attempt to use CPU-local state to batch operations +- Attempt to fetch free nodes from global lists +- Attempt to pull any node from a global list and remove it from the hashmap +- Attempt to pull any node from any CPU's list and remove it from the hashmap + +This algorithm is described visually in the following diagram. See the +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of +the corresponding operations: + +.. kernel-figure:: map_lru_hash_update.dot + :alt: Diagram outlining the LRU eviction steps taken during map update. + + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and + variants. See the dot file source for kernel function name code references. + +Map updates start from the oval in the top right "begin ``bpf_map_update()``" +and progress through the graph towards the bottom where the result may be +either a successful update or a failure with various error codes. The key in +the top right provides indicators for which locks may be involved in specific +operations. This is intended as a visual hint for reasoning about how map +contention may impact update operations, though the map type and flags may +impact the actual contention on those locks, based on the logic described in +the table above. For instance, if the map is created with type +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map +properties would be per-cpu. diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot new file mode 100644 index 000000000000..a0fee349d29c --- /dev/null +++ b/Documentation/bpf/map_lru_hash_update.dot @@ -0,0 +1,172 @@ +// SPDX-License-Identifier: GPL-2.0-only +// Copyright (C) 2022-2023 Isovalent, Inc. +digraph { + node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes + graph [splines=ortho, nodesep=1] + + subgraph cluster_key { + label = "Key\n(locks held during operation)"; + rankdir = TB; + + remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"] + hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"] + lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"] + local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"] + no_lock [shape=rectangle,label="no locks held"] + } + + begin [shape=oval,label="begin\nbpf_map_update()"] + + // Nodes below with an 'fn_' prefix are roughly labeled by the C function + // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c. + // Number suffixes and errno suffixes handle subsections of the corresponding + // logic in the function as of the writing of this dot. + + // cf. __local_list_pop_free() / bpf_percpu_lru_pop_free() + local_freelist_check [shape=diamond,fillcolor=1, + label="Local freelist\nnode available?"]; + use_local_node [shape=rectangle, + label="Use node owned\nby this CPU"] + + // cf. bpf_lru_pop_free() + common_lru_check [shape=diamond, + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; + + fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2, + label="Flush local pending, + Rotate Global list, move + LOCAL_FREE_TARGET + from global -> local"] + // Also corresponds to: + // fn__local_list_flush() + // fn_bpf_lru_list_rotate() + fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2, + label="Able to free\nLOCAL_FREE_TARGET\nnodes?"] + + fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3, + label="Shrink inactive list + up to remaining + LOCAL_FREE_TARGET + (global LRU -> local)"] + fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2, + label="> 0 entries in\nlocal free list?"] + fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2, + label="Steal one node from + inactive, or if empty, + from active global list"] + fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3, + label="Try to remove\nnode from hashtab"] + + local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"] + common_lru_check2 [shape=diamond, + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; + + subgraph cluster_remote_lock { + label = "Iterate through CPUs\n(start from current)"; + style = dashed; + rankdir=LR; + + local_freelist_check5 [shape=diamond,fillcolor=4, + label="Steal a node from\nper-cpu freelist?"] + local_freelist_check6 [shape=rectangle,fillcolor=4, + label="Steal a node from + (1) Unreferenced pending, or + (2) Any pending node"] + local_freelist_check7 [shape=rectangle,fillcolor=3, + label="Try to remove\nnode from hashtab"] + fn_htab_lru_map_update_elem [shape=diamond, + label="Stole node\nfrom remote\nCPU?"] + fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"] + // Also corresponds to: + // use_local_node() + // fn__local_list_pop_pending() + } + + fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle, + label="Use node that was\nnot recently referenced"] + local_freelist_check4 [shape=rectangle, + label="Use node that was\nactively referenced\nin global list"] + fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"] + fn_htab_lru_map_update_elem3 [shape=rectangle, + label="Use node that was\nactively referenced\nin (another?) CPU's cache"] + fn_htab_lru_map_update_elem4 [shape=rectangle,fillcolor=3, + label="Update hashmap\nwith new element"] + fn_htab_lru_map_update_elem5 [shape=oval,label="return 0"] + fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"] + fn_htab_lru_map_update_elem_EEXIST [shape=oval,label="return -EEXIST"] + fn_htab_lru_map_update_elem_ENOENT [shape=oval,label="return -ENOENT"] + + begin -> local_freelist_check + local_freelist_check -> use_local_node [xlabel="Y"] + local_freelist_check -> common_lru_check [xlabel="N"] + common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"] + common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"] + fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free + fn___bpf_lru_node_move_to_free -> + fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"] + fn___bpf_lru_node_move_to_free -> + fn___bpf_lru_list_shrink_inactive [xlabel="N"] + fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink + fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"] + fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"] + fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3 + fn___bpf_lru_list_shrink3 -> local_freelist_check2 + local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"] + local_freelist_check2 -> common_lru_check2 [xlabel = "N"] + common_lru_check2 -> local_freelist_check5 [xlabel = "Y"] + common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"] + local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"] + local_freelist_check5 -> local_freelist_check6 [xlabel = "N"] + local_freelist_check6 -> local_freelist_check7 + local_freelist_check7 -> fn_htab_lru_map_update_elem + + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"] + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2 [xlabel = "N"] + fn_htab_lru_map_update_elem2 -> + fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"] + fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"] + fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4 + + use_local_node -> fn_htab_lru_map_update_elem4 + fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4 + local_freelist_check4 -> fn_htab_lru_map_update_elem4 + + fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [headlabel="Success"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_EBUSY [xlabel="Hashtab lock failed"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_EEXIST [xlabel="BPF_EXIST set and\nkey already exists"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_ENOENT [headlabel="BPF_NOEXIST set\nand no such entry"] + + // Create invisible pad nodes to line up various nodes + pad0 [style=invis] + pad1 [style=invis] + pad2 [style=invis] + pad3 [style=invis] + pad4 [style=invis] + + // Line up the key with the top of the graph + no_lock -> local_lock [style=invis] + local_lock -> lru_lock [style=invis] + lru_lock -> hash_lock [style=invis] + hash_lock -> remote_lock [style=invis] + remote_lock -> local_freelist_check5 [style=invis] + remote_lock -> fn___bpf_lru_list_shrink [style=invis] + + // Line up return code nodes at the bottom of the graph + fn_htab_lru_map_update_elem -> pad0 [style=invis] + pad0 -> pad1 [style=invis] + pad1 -> pad2 [style=invis] + //pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis] + fn_htab_lru_map_update_elem4 -> pad3 [style=invis] + pad3 -> fn_htab_lru_map_update_elem5 [style=invis] + pad3 -> fn_htab_lru_map_update_elem_EBUSY [style=invis] + pad3 -> fn_htab_lru_map_update_elem_EEXIST [style=invis] + pad3 -> fn_htab_lru_map_update_elem_ENOENT [style=invis] + + // Reduce diagram width by forcing some nodes to appear above others + local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis] + common_lru_check2 -> pad4 [style=invis] + pad4 -> local_freelist_check5 [style=invis] +} -- cgit v1.2.3 From 69535186297b37e6e0a16290766666f4e8a55793 Mon Sep 17 00:00:00 2001 From: Will Hawkins Date: Thu, 27 Apr 2023 22:30:15 -0400 Subject: bpf, docs: Update llvm_relocs.rst with typo fixes Correct a few typographical errors and fix some mistakes in examples. Signed-off-by: Will Hawkins Acked-by: Yonghong Song Link: https://lore.kernel.org/r/20230428023015.1698072-2-hawkinsw@obs.cr Signed-off-by: Alexei Starovoitov --- Documentation/bpf/llvm_reloc.rst | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/llvm_reloc.rst b/Documentation/bpf/llvm_reloc.rst index ca8957d5b671..e4a777a6a3a2 100644 --- a/Documentation/bpf/llvm_reloc.rst +++ b/Documentation/bpf/llvm_reloc.rst @@ -48,7 +48,7 @@ the code with ``llvm-objdump -dr test.o``:: 14: 0f 10 00 00 00 00 00 00 r0 += r1 15: 95 00 00 00 00 00 00 00 exit -There are four relations in the above for four ``LD_imm64`` instructions. +There are four relocations in the above for four ``LD_imm64`` instructions. The following ``llvm-readelf -r test.o`` shows the binary values of the four relocations:: @@ -79,14 +79,16 @@ The following is the symbol table with ``llvm-readelf -s test.o``:: The 6th entry is global variable ``g1`` with value 0. Similarly, the second relocation is at ``.text`` offset ``0x18``, instruction 3, -for global variable ``g2`` which has a symbol value 4, the offset -from the start of ``.data`` section. - -The third and fourth relocations refers to static variables ``l1`` -and ``l2``. From ``.rel.text`` section above, it is not clear -which symbols they really refers to as they both refers to +has a type of ``R_BPF_64_64`` and refers to entry 7 in the symbol table. +The second relocation resolves to global variable ``g2`` which has a symbol +value 4. The symbol value represents the offset from the start of ``.data`` +section where the initial value of the global variable ``g2`` is stored. + +The third and fourth relocations refer to static variables ``l1`` +and ``l2``. From the ``.rel.text`` section above, it is not clear +to which symbols they really refer as they both refer to symbol table entry 4, symbol ``sec``, which has ``STT_SECTION`` type -and represents a section. So for static variable or function, +and represents a section. So for a static variable or function, the section offset is written to the original insn buffer, which is called ``A`` (addend). Looking at above insn ``7`` and ``11``, they have section offset ``8`` and ``12``. -- cgit v1.2.3 From 3bda08b63670c39be390fcb00e7718775508e673 Mon Sep 17 00:00:00 2001 From: Daniel Rosenberg Date: Fri, 5 May 2023 18:31:30 -0700 Subject: bpf: Allow NULL buffers in bpf_dynptr_slice(_rw) bpf_dynptr_slice(_rw) uses a user provided buffer if it can not provide a pointer to a block of contiguous memory. This buffer is unused in the case of local dynptrs, and may be unused in other cases as well. There is no need to require the buffer, as the kfunc can just return NULL if it was needed and not provided. This adds another kfunc annotation, __opt, which combines with __sz and __szk to allow the buffer associated with the size to be NULL. If the buffer is NULL, the verifier does not check that the buffer is of sufficient size. Signed-off-by: Daniel Rosenberg Link: https://lore.kernel.org/r/20230506013134.2492210-2-drosen@google.com Signed-off-by: Alexei Starovoitov --- Documentation/bpf/kfuncs.rst | 23 ++++++++++++++++++++++- include/linux/skbuff.h | 2 +- kernel/bpf/helpers.c | 30 ++++++++++++++++++------------ kernel/bpf/verifier.c | 17 +++++++++++++---- 4 files changed, 54 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index ea2516374d92..7a3d9de5f315 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -100,7 +100,7 @@ Hence, whenever a constant scalar argument is accepted by a kfunc which is not a size parameter, and the value of the constant matters for program safety, __k suffix should be used. -2.2.2 __uninit Annotation +2.2.3 __uninit Annotation ------------------------- This annotation is used to indicate that the argument will be treated as @@ -117,6 +117,27 @@ Here, the dynptr will be treated as an uninitialized dynptr. Without this annotation, the verifier will reject the program if the dynptr passed in is not initialized. +2.2.4 __opt Annotation +------------------------- + +This annotation is used to indicate that the buffer associated with an __sz or __szk +argument may be null. If the function is passed a nullptr in place of the buffer, +the verifier will not check that length is appropriate for the buffer. The kfunc is +responsible for checking if this buffer is null before using it. + +An example is given below:: + + __bpf_kfunc void *bpf_dynptr_slice(..., void *buffer__opt, u32 buffer__szk) + { + ... + } + +Here, the buffer may be null. If buffer is not null, it at least of size buffer_szk. +Either way, the returned buffer is either NULL, or of size buffer_szk. Without this +annotation, the verifier will reject the program if a null pointer is passed in with +a nonzero size. + + .. _BPF_kfunc_nodef: 2.3 Using an existing kernel function diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 738776ab8838..8ddb4af1a501 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4033,7 +4033,7 @@ __skb_header_pointer(const struct sk_buff *skb, int offset, int len, if (likely(hlen - offset >= len)) return (void *)data + offset; - if (!skb || unlikely(skb_copy_bits(skb, offset, buffer, len) < 0)) + if (!skb || !buffer || unlikely(skb_copy_bits(skb, offset, buffer, len) < 0)) return NULL; return buffer; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a128fe0ab2d0..4ef4c4f8a355 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2190,13 +2190,15 @@ __bpf_kfunc struct task_struct *bpf_task_from_pid(s32 pid) * bpf_dynptr_slice() - Obtain a read-only pointer to the dynptr data. * @ptr: The dynptr whose data slice to retrieve * @offset: Offset into the dynptr - * @buffer: User-provided buffer to copy contents into - * @buffer__szk: Size (in bytes) of the buffer. This is the length of the - * requested slice. This must be a constant. + * @buffer__opt: User-provided buffer to copy contents into. May be NULL + * @buffer__szk: Size (in bytes) of the buffer if present. This is the + * length of the requested slice. This must be a constant. * * For non-skb and non-xdp type dynptrs, there is no difference between * bpf_dynptr_slice and bpf_dynptr_data. * + * If buffer__opt is NULL, the call will fail if buffer_opt was needed. + * * If the intention is to write to the data slice, please use * bpf_dynptr_slice_rdwr. * @@ -2213,7 +2215,7 @@ __bpf_kfunc struct task_struct *bpf_task_from_pid(s32 pid) * direct pointer) */ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset, - void *buffer, u32 buffer__szk) + void *buffer__opt, u32 buffer__szk) { enum bpf_dynptr_type type; u32 len = buffer__szk; @@ -2233,15 +2235,17 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset case BPF_DYNPTR_TYPE_RINGBUF: return ptr->data + ptr->offset + offset; case BPF_DYNPTR_TYPE_SKB: - return skb_header_pointer(ptr->data, ptr->offset + offset, len, buffer); + return skb_header_pointer(ptr->data, ptr->offset + offset, len, buffer__opt); case BPF_DYNPTR_TYPE_XDP: { void *xdp_ptr = bpf_xdp_pointer(ptr->data, ptr->offset + offset, len); if (xdp_ptr) return xdp_ptr; - bpf_xdp_copy_buf(ptr->data, ptr->offset + offset, buffer, len, false); - return buffer; + if (!buffer__opt) + return NULL; + bpf_xdp_copy_buf(ptr->data, ptr->offset + offset, buffer__opt, len, false); + return buffer__opt; } default: WARN_ONCE(true, "unknown dynptr type %d\n", type); @@ -2253,13 +2257,15 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset * bpf_dynptr_slice_rdwr() - Obtain a writable pointer to the dynptr data. * @ptr: The dynptr whose data slice to retrieve * @offset: Offset into the dynptr - * @buffer: User-provided buffer to copy contents into - * @buffer__szk: Size (in bytes) of the buffer. This is the length of the - * requested slice. This must be a constant. + * @buffer__opt: User-provided buffer to copy contents into. May be NULL + * @buffer__szk: Size (in bytes) of the buffer if present. This is the + * length of the requested slice. This must be a constant. * * For non-skb and non-xdp type dynptrs, there is no difference between * bpf_dynptr_slice and bpf_dynptr_data. * + * If buffer__opt is NULL, the call will fail if buffer_opt was needed. + * * The returned pointer is writable and may point to either directly the dynptr * data at the requested offset or to the buffer if unable to obtain a direct * data pointer to (example: the requested slice is to the paged area of an skb @@ -2290,7 +2296,7 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset * direct pointer) */ __bpf_kfunc void *bpf_dynptr_slice_rdwr(const struct bpf_dynptr_kern *ptr, u32 offset, - void *buffer, u32 buffer__szk) + void *buffer__opt, u32 buffer__szk) { if (!ptr->data || __bpf_dynptr_is_rdonly(ptr)) return NULL; @@ -2317,7 +2323,7 @@ __bpf_kfunc void *bpf_dynptr_slice_rdwr(const struct bpf_dynptr_kern *ptr, u32 o * will be copied out into the buffer and the user will need to call * bpf_dynptr_write() to commit changes. */ - return bpf_dynptr_slice(ptr, offset, buffer, buffer__szk); + return bpf_dynptr_slice(ptr, offset, buffer__opt, buffer__szk); } __bpf_kfunc int bpf_dynptr_adjust(struct bpf_dynptr_kern *ptr, u32 start, u32 end) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0fa96581eb77..7e6bbae9db81 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9743,6 +9743,11 @@ static bool is_kfunc_arg_const_mem_size(const struct btf *btf, return __kfunc_param_match_suffix(btf, arg, "__szk"); } +static bool is_kfunc_arg_optional(const struct btf *btf, const struct btf_param *arg) +{ + return __kfunc_param_match_suffix(btf, arg, "__opt"); +} + static bool is_kfunc_arg_constant(const struct btf *btf, const struct btf_param *arg) { return __kfunc_param_match_suffix(btf, arg, "__k"); @@ -10830,13 +10835,17 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ break; case KF_ARG_PTR_TO_MEM_SIZE: { + struct bpf_reg_state *buff_reg = ®s[regno]; + const struct btf_param *buff_arg = &args[i]; struct bpf_reg_state *size_reg = ®s[regno + 1]; const struct btf_param *size_arg = &args[i + 1]; - ret = check_kfunc_mem_size_reg(env, size_reg, regno + 1); - if (ret < 0) { - verbose(env, "arg#%d arg#%d memory, len pair leads to invalid memory access\n", i, i + 1); - return ret; + if (!register_is_null(buff_reg) || !is_kfunc_arg_optional(meta->btf, buff_arg)) { + ret = check_kfunc_mem_size_reg(env, size_reg, regno + 1); + if (ret < 0) { + verbose(env, "arg#%d arg#%d memory, len pair leads to invalid memory access\n", i, i + 1); + return ret; + } } if (is_kfunc_arg_const_mem_size(meta->btf, size_arg, size_reg)) { -- cgit v1.2.3 From ccce324dabfe2143519daf50ed8b1ef1d0c542f7 Mon Sep 17 00:00:00 2001 From: David Morley Date: Tue, 9 May 2023 18:05:58 +0000 Subject: tcp: make the first N SYN RTO backoffs linear Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In such cases, it's better to aggressively try to retransmit packets, so it takes routers less time to find a repath with a working link. We chose a default value for this sysctl of 4, to follow the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ... MacOS and IOS have used this backoff schedule for over a decade, since before this 2009 IETF presentation discussed the behavior: https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf This commit makes the SYN RTO schedule start with a number of linear backoffs given by the following sysctl: * tcp_syn_linear_timeouts This changes the SYN RTO scheme to be: init_rto_val for tcp_syn_linear_timeouts, exp backoff starting at init_rto_val For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ... Signed-off-by: David Morley Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Tested-by: David Morley Reviewed-by: Eric Dumazet Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com Signed-off-by: Paolo Abeni --- Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++--- include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_timer.c | 17 +++++++++++++---- 5 files changed, 39 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 6ec06a33688a..3f6d3d5f5626 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) + till the last retransmission with the current initial RTO of 1second. + With this the final timeout for an active TCP connection attempt + will happen after 131seconds. tcp_timestamps - INTEGER Enable timestamps as defined in RFC1323. @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER Default: 120 +tcp_syn_linear_timeouts - INTEGER + The number of times for an active TCP connection to retransmit SYNs with + a linear backoff timeout before defaulting to an exponential backoff + timeout. This has no effect on SYNACK at the passive TCP side. + + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, + and the first exponential backoff using 2^0 * initial_RTO). + Default: 4 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index db762e35aca9..a4efb7a2796c 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -194,6 +194,7 @@ struct netns_ipv4 { int sysctl_udp_rmem_min; u8 sysctl_fib_notify_on_flag_change; + u8 sysctl_tcp_syn_linear_timeouts; #ifdef CONFIG_NET_L3_MASTER_DEV u8 sysctl_udp_l3mdev_accept; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 40fe70fc2015..6ae3345a3bdf 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -34,6 +34,7 @@ static int ip_ttl_min = 1; static int ip_ttl_max = 255; static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; static u32 u32_max_div_HZ = UINT_MAX / HZ; @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = &tcp_plb_max_cong_thresh, }, + { + .procname = "tcp_syn_linear_timeouts", + .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = proc_dou8vec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &tcp_syn_linear_timeouts_max, + }, { } }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 39bda2b1066e..db24ed8f8509 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net) else net->ipv4.tcp_congestion_control = &tcp_reno; + net->ipv4.sysctl_tcp_syn_linear_timeouts = 4; return 0; } diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index b839c2f91292..0d93a2573807 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); struct net *net = sock_net(sk); bool expired = false, do_reset; - int retry_until; + int retry_until, max_retransmits; if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { if (icsk->icsk_retransmits) __dst_negative_advice(sk); retry_until = icsk->icsk_syn_retries ? : READ_ONCE(net->ipv4.sysctl_tcp_syn_retries); - expired = icsk->icsk_retransmits >= retry_until; + + max_retransmits = retry_until; + if (sk->sk_state == TCP_SYN_SENT) + max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts); + + expired = icsk->icsk_retransmits >= max_retransmits; } else { if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) { /* Black hole detection */ @@ -577,8 +582,12 @@ out_reset_timer: icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) { icsk->icsk_backoff = 0; icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX); - } else { - /* Use normal (exponential) backoff */ + } else if (sk->sk_state != TCP_SYN_SENT || + icsk->icsk_backoff > + READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) { + /* Use normal (exponential) backoff unless linear timeouts are + * activated. + */ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); } inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, -- cgit v1.2.3 From eefca7ec514262aef08d0ef261552f2f604bd851 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Thu, 11 May 2023 11:49:50 -0400 Subject: net/handshake: Enable the SNI extension to work properly Enable the upper layer protocol to specify the SNI peername. This avoids the need for tlshd to use a DNS lookup, which can return a hostname that doesn't match the incoming certificate's SubjectName. Fixes: 2fd5532044a8 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake") Reviewed-by: Simon Horman Signed-off-by: Chuck Lever Signed-off-by: David S. Miller --- Documentation/netlink/specs/handshake.yaml | 4 ++++ Documentation/networking/tls-handshake.rst | 5 +++++ include/net/handshake.h | 1 + include/uapi/linux/handshake.h | 1 + net/handshake/tlshd.c | 8 ++++++++ 5 files changed, 19 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/handshake.yaml b/Documentation/netlink/specs/handshake.yaml index 614f1a585511..6d89e30f5fd5 100644 --- a/Documentation/netlink/specs/handshake.yaml +++ b/Documentation/netlink/specs/handshake.yaml @@ -68,6 +68,9 @@ attribute-sets: type: nest nested-attributes: x509 multi-attr: true + - + name: peername + type: string - name: done attributes: @@ -105,6 +108,7 @@ operations: - auth-mode - peer-identity - certificate + - peername - name: done doc: Handler reports handshake completion diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst index a2817a88e905..6f5ea1646a47 100644 --- a/Documentation/networking/tls-handshake.rst +++ b/Documentation/networking/tls-handshake.rst @@ -53,6 +53,7 @@ fills in a structure that contains the parameters of the request: struct socket *ta_sock; tls_done_func_t ta_done; void *ta_data; + const char *ta_peername; unsigned int ta_timeout_ms; key_serial_t ta_keyring; key_serial_t ta_my_cert; @@ -71,6 +72,10 @@ instantiated a struct file in sock->file. has completed. Further explanation of this function is in the "Handshake Completion" sesction below. +The consumer can provide a NUL-terminated hostname in the @ta_peername +field that is sent as part of ClientHello. If no peername is provided, +the DNS hostname associated with the server's IP address is used instead. + The consumer can fill in the @ta_timeout_ms field to force the servicing handshake agent to exit after a number of milliseconds. This enables the socket to be fully closed once both the kernel and the handshake agent diff --git a/include/net/handshake.h b/include/net/handshake.h index 3352b1ab43b3..2e26e436e85f 100644 --- a/include/net/handshake.h +++ b/include/net/handshake.h @@ -24,6 +24,7 @@ struct tls_handshake_args { struct socket *ta_sock; tls_done_func_t ta_done; void *ta_data; + const char *ta_peername; unsigned int ta_timeout_ms; key_serial_t ta_keyring; key_serial_t ta_my_cert; diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h index 1de4d0b95325..3d7ea58778c9 100644 --- a/include/uapi/linux/handshake.h +++ b/include/uapi/linux/handshake.h @@ -44,6 +44,7 @@ enum { HANDSHAKE_A_ACCEPT_AUTH_MODE, HANDSHAKE_A_ACCEPT_PEER_IDENTITY, HANDSHAKE_A_ACCEPT_CERTIFICATE, + HANDSHAKE_A_ACCEPT_PEERNAME, __HANDSHAKE_A_ACCEPT_MAX, HANDSHAKE_A_ACCEPT_MAX = (__HANDSHAKE_A_ACCEPT_MAX - 1) diff --git a/net/handshake/tlshd.c b/net/handshake/tlshd.c index fcbeb63b4eb1..b735f5cced2f 100644 --- a/net/handshake/tlshd.c +++ b/net/handshake/tlshd.c @@ -31,6 +31,7 @@ struct tls_handshake_req { int th_type; unsigned int th_timeout_ms; int th_auth_mode; + const char *th_peername; key_serial_t th_keyring; key_serial_t th_certificate; key_serial_t th_privkey; @@ -48,6 +49,7 @@ tls_handshake_req_init(struct handshake_req *req, treq->th_timeout_ms = args->ta_timeout_ms; treq->th_consumer_done = args->ta_done; treq->th_consumer_data = args->ta_data; + treq->th_peername = args->ta_peername; treq->th_keyring = args->ta_keyring; treq->th_num_peerids = 0; treq->th_certificate = TLS_NO_CERT; @@ -214,6 +216,12 @@ static int tls_handshake_accept(struct handshake_req *req, ret = nla_put_u32(msg, HANDSHAKE_A_ACCEPT_MESSAGE_TYPE, treq->th_type); if (ret < 0) goto out_cancel; + if (treq->th_peername) { + ret = nla_put_string(msg, HANDSHAKE_A_ACCEPT_PEERNAME, + treq->th_peername); + if (ret < 0) + goto out_cancel; + } if (treq->th_timeout_ms) { ret = nla_put_u32(msg, HANDSHAKE_A_ACCEPT_TIMEOUT, treq->th_timeout_ms); if (ret < 0) -- cgit v1.2.3 From 6b6a23d5d8e857e0dda1bbe5043cf4d5e9c711d3 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 11 May 2023 10:04:56 -0700 Subject: bpf: Document EFAULT changes for sockopt And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com Signed-off-by: Martin KaFai Lau --- Documentation/bpf/prog_cgroup_sockopt.rst | 57 ++++++++++++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/bpf/prog_cgroup_sockopt.rst b/Documentation/bpf/prog_cgroup_sockopt.rst index 172f957204bf..1226a94af07a 100644 --- a/Documentation/bpf/prog_cgroup_sockopt.rst +++ b/Documentation/bpf/prog_cgroup_sockopt.rst @@ -98,10 +98,65 @@ can access only the first ``PAGE_SIZE`` of that data. So it has to options: indicates that the kernel should use BPF's trimmed ``optval``. When the BPF program returns with the ``optlen`` greater than -``PAGE_SIZE``, the userspace will receive ``EFAULT`` errno. +``PAGE_SIZE``, the userspace will receive original kernel +buffers without any modifications that the BPF program might have +applied. Example ======= +Recommended way to handle BPF programs is as follows: + +.. code-block:: c + + SEC("cgroup/getsockopt") + int getsockopt(struct bpf_sockopt *ctx) + { + /* Custom socket option. */ + if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { + ctx->retval = 0; + optval[0] = ...; + ctx->optlen = 1; + return 1; + } + + /* Modify kernel's socket option. */ + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { + ctx->retval = 0; + optval[0] = ...; + ctx->optlen = 1; + return 1; + } + + /* optval larger than PAGE_SIZE use kernel's buffer. */ + if (ctx->optlen > PAGE_SIZE) + ctx->optlen = 0; + + return 1; + } + + SEC("cgroup/setsockopt") + int setsockopt(struct bpf_sockopt *ctx) + { + /* Custom socket option. */ + if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { + /* do something */ + ctx->optlen = -1; + return 1; + } + + /* Modify kernel's socket option. */ + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { + optval[0] = ...; + return 1; + } + + /* optval larger than PAGE_SIZE use kernel's buffer. */ + if (ctx->optlen > PAGE_SIZE) + ctx->optlen = 0; + + return 1; + } + See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example of BPF program that handles socket options. -- cgit v1.2.3 From b2cbac9b9b28730e9e53be20b6cdf979d3b9f27e Mon Sep 17 00:00:00 2001 From: Angus Chen Date: Fri, 12 May 2023 09:01:52 +0800 Subject: net: Remove low_thresh in ip defrag As low_thresh has no work in fragment reassembles,del it. And Mark it deprecated in sysctl Document. Signed-off-by: Angus Chen Signed-off-by: David S. Miller --- Documentation/networking/nf_conntrack-sysctl.rst | 1 + include/net/inet_frag.h | 1 - net/ieee802154/6lowpan/reassembly.c | 9 ++++----- net/ipv4/ip_fragment.c | 13 +++++-------- net/ipv6/netfilter/nf_conntrack_reasm.c | 9 ++++----- net/ipv6/reassembly.c | 9 ++++----- 6 files changed, 18 insertions(+), 24 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index 8b1045c3b59e..9ca356bc7217 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -55,6 +55,7 @@ nf_conntrack_frag6_high_thresh - INTEGER nf_conntrack_frag6_low_thresh is reached. nf_conntrack_frag6_low_thresh - INTEGER + (Obsolete since linux-4.17) default 196608 See nf_conntrack_frag6_low_thresh diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 325ad893f624..8543e740891a 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -13,7 +13,6 @@ struct fqdir { /* sysctls */ long high_thresh; - long low_thresh; int timeout; int max_dist; struct inet_frags *f; diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c index a91283d1e5bf..3ba4c0f27af9 100644 --- a/net/ieee802154/6lowpan/reassembly.c +++ b/net/ieee802154/6lowpan/reassembly.c @@ -318,7 +318,7 @@ err: } #ifdef CONFIG_SYSCTL - +static unsigned long lowpanfrag_low_thresh_unuesd = IPV6_FRAG_LOW_THRESH; static struct ctl_table lowpan_frags_ns_ctl_table[] = { { .procname = "6lowpanfrag_high_thresh", @@ -374,9 +374,9 @@ static int __net_init lowpan_frags_ns_sysctl_register(struct net *net) } table[0].data = &ieee802154_lowpan->fqdir->high_thresh; - table[0].extra1 = &ieee802154_lowpan->fqdir->low_thresh; - table[1].data = &ieee802154_lowpan->fqdir->low_thresh; - table[1].extra2 = &ieee802154_lowpan->fqdir->high_thresh; + table[0].extra1 = &lowpanfrag_low_thresh_unuesd; + table[1].data = &lowpanfrag_low_thresh_unuesd; + table[1].extra2 = &ieee802154_lowpan->fqdir->high_thresh; table[2].data = &ieee802154_lowpan->fqdir->timeout; hdr = register_net_sysctl(net, "net/ieee802154/6lowpan", table); @@ -451,7 +451,6 @@ static int __net_init lowpan_frags_init_net(struct net *net) return res; ieee802154_lowpan->fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; - ieee802154_lowpan->fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; ieee802154_lowpan->fqdir->timeout = IPV6_FRAG_TIMEOUT; res = lowpan_frags_ns_sysctl_register(net); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 69c00ffdcf3e..0db5eb3dec83 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -553,7 +553,7 @@ EXPORT_SYMBOL(ip_check_defrag); #ifdef CONFIG_SYSCTL static int dist_min; - +static unsigned long ipfrag_low_thresh_unused; static struct ctl_table ip4_frags_ns_ctl_table[] = { { .procname = "ipfrag_high_thresh", @@ -609,9 +609,9 @@ static int __net_init ip4_frags_ns_ctl_register(struct net *net) } table[0].data = &net->ipv4.fqdir->high_thresh; - table[0].extra1 = &net->ipv4.fqdir->low_thresh; - table[1].data = &net->ipv4.fqdir->low_thresh; - table[1].extra2 = &net->ipv4.fqdir->high_thresh; + table[0].extra1 = &ipfrag_low_thresh_unused; + table[1].data = &ipfrag_low_thresh_unused; + table[1].extra2 = &net->ipv4.fqdir->high_thresh; table[2].data = &net->ipv4.fqdir->timeout; table[3].data = &net->ipv4.fqdir->max_dist; @@ -674,12 +674,9 @@ static int __net_init ipv4_frags_init_net(struct net *net) * A 64K fragment consumes 129736 bytes (44*2944)+200 * (1500 truesize == 2944, sizeof(struct ipq) == 200) * - * We will commit 4MB at one time. Should we cross that limit - * we will prune down to 3MB, making room for approx 8 big 64K - * fragments 8x128k. + * We will commit 4MB at one time. Should we cross that limit. */ net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024; - net->ipv4.fqdir->low_thresh = 3 * 1024 * 1024; /* * Important NOTE! Fragment queue must be destroyed before MSL expires. * RFC791 is wrong proposing to prolongate timer each fragment arrival diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index d13240f13607..dc8a2854e7f3 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -42,7 +42,7 @@ static struct nft_ct_frag6_pernet *nf_frag_pernet(struct net *net) } #ifdef CONFIG_SYSCTL - +static unsigned long nf_conntrack_frag6_low_thresh_unused = IPV6_FRAG_LOW_THRESH; static struct ctl_table nf_ct_frag6_sysctl_table[] = { { .procname = "nf_conntrack_frag6_timeout", @@ -82,10 +82,10 @@ static int nf_ct_frag6_sysctl_register(struct net *net) nf_frag = nf_frag_pernet(net); table[0].data = &nf_frag->fqdir->timeout; - table[1].data = &nf_frag->fqdir->low_thresh; - table[1].extra2 = &nf_frag->fqdir->high_thresh; + table[1].data = &nf_conntrack_frag6_low_thresh_unused; + table[1].extra2 = &nf_frag->fqdir->high_thresh; table[2].data = &nf_frag->fqdir->high_thresh; - table[2].extra1 = &nf_frag->fqdir->low_thresh; + table[2].extra1 = &nf_conntrack_frag6_low_thresh_unused; hdr = register_net_sysctl(net, "net/netfilter", table); if (hdr == NULL) @@ -500,7 +500,6 @@ static int nf_ct_net_init(struct net *net) return res; nf_frag->fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; - nf_frag->fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; nf_frag->fqdir->timeout = IPV6_FRAG_TIMEOUT; res = nf_ct_frag6_sysctl_register(net); diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index 5bc8a28e67f9..eb8373c25675 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -416,7 +416,7 @@ static const struct inet6_protocol frag_protocol = { }; #ifdef CONFIG_SYSCTL - +static unsigned long ip6_frags_low_thresh_unused = IPV6_FRAG_LOW_THRESH; static struct ctl_table ip6_frags_ns_ctl_table[] = { { .procname = "ip6frag_high_thresh", @@ -465,9 +465,9 @@ static int __net_init ip6_frags_ns_sysctl_register(struct net *net) } table[0].data = &net->ipv6.fqdir->high_thresh; - table[0].extra1 = &net->ipv6.fqdir->low_thresh; - table[1].data = &net->ipv6.fqdir->low_thresh; - table[1].extra2 = &net->ipv6.fqdir->high_thresh; + table[0].extra1 = &ip6_frags_low_thresh_unused; + table[1].data = &ip6_frags_low_thresh_unused; + table[1].extra2 = &net->ipv6.fqdir->high_thresh; table[2].data = &net->ipv6.fqdir->timeout; hdr = register_net_sysctl(net, "net/ipv6", table); @@ -536,7 +536,6 @@ static int __net_init ipv6_frags_init_net(struct net *net) return res; net->ipv6.fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; - net->ipv6.fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; net->ipv6.fqdir->timeout = IPV6_FRAG_TIMEOUT; res = ip6_frags_ns_sysctl_register(net); -- cgit v1.2.3 From efe103065ccb4984c094d1547d71d498129cdd89 Mon Sep 17 00:00:00 2001 From: Hariprasad Kelam Date: Sat, 13 May 2023 14:21:43 +0530 Subject: docs: octeontx2: Add Documentation for QOS Add QOS example configuration along with tc-htb commands Signed-off-by: Hariprasad Kelam Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- .../device_drivers/ethernet/marvell/octeontx2.rst | 45 ++++++++++++++++++++++ 1 file changed, 45 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst index 5ba9015336e2..bfd233cfac35 100644 --- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst @@ -13,6 +13,7 @@ Contents - `Drivers`_ - `Basic packet flow`_ - `Devlink health reporters`_ +- `Quality of service`_ Overview ======== @@ -287,3 +288,47 @@ For example:: NIX_AF_ERR: NIX Error Interrupt Reg : 64 Rx on unmapped PF_FUNC + + +Quality of service +================== + + +Hardware algorithms used in scheduling +-------------------------------------- + +octeontx2 silicon and CN10K transmit interface consists of five transmit levels +starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1 +levels. Each level contains an array of queues to support scheduling and shaping. +The hardware uses the below algorithms depending on the priority of scheduler queues. +once the usercreates tc classes with different priorities, the driver configures +schedulers allocated to the class with specified priority along with rate-limiting +configuration. + +1. Strict Priority + + - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority + using strict priority. + +2. Round Robin + + - Active MDQs having the same priority level are chosen using round robin. + + +Setup HTB offload +----------------- + +1. Enable HW TC offload on the interface:: + + # ethtool -K hw-tc-offload on + +2. Crate htb root:: + + # tc qdisc add dev clsact + # tc qdisc replace dev root handle 1: htb offload + +3. Create tc classes with different priorities:: + + # tc class add dev parent 1: classid 1:1 htb rate 10Gbit prio 1 + + # tc class add dev parent 1: classid 1:2 htb rate 10Gbit prio 7 -- cgit v1.2.3 From e7480a44d7c4ce4691fa6bcdb0318f0d81fe4b12 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 16 May 2023 20:41:12 -0700 Subject: Revert "net: Remove low_thresh in ip defrag" This reverts commit b2cbac9b9b28730e9e53be20b6cdf979d3b9f27e. We have multiple reports of obvious breakage from this patch. Reported-by: Ido Schimmel Link: https://lore.kernel.org/all/ZGIRWjNcfqI8yY8W@shredder/ Link: https://lore.kernel.org/all/CADJHv_sDK=0RrMA2FTZQV5fw7UQ+qY=HG21Wu5qb0V9vvx5w6A@mail.gmail.com/ Reported-by: syzbot+a5e719ac7c268e414c95@syzkaller.appspotmail.com Reported-by: syzbot+a03fd670838d927d9cd8@syzkaller.appspotmail.com Fixes: b2cbac9b9b28 ("net: Remove low_thresh in ip defrag") Link: https://lore.kernel.org/r/20230517034112.1261835-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/nf_conntrack-sysctl.rst | 1 - include/net/inet_frag.h | 1 + net/ieee802154/6lowpan/reassembly.c | 9 +++++---- net/ipv4/ip_fragment.c | 13 ++++++++----- net/ipv6/netfilter/nf_conntrack_reasm.c | 9 +++++---- net/ipv6/reassembly.c | 9 +++++---- 6 files changed, 24 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index 9ca356bc7217..8b1045c3b59e 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -55,7 +55,6 @@ nf_conntrack_frag6_high_thresh - INTEGER nf_conntrack_frag6_low_thresh is reached. nf_conntrack_frag6_low_thresh - INTEGER - (Obsolete since linux-4.17) default 196608 See nf_conntrack_frag6_low_thresh diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 8543e740891a..325ad893f624 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -13,6 +13,7 @@ struct fqdir { /* sysctls */ long high_thresh; + long low_thresh; int timeout; int max_dist; struct inet_frags *f; diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c index 3ba4c0f27af9..a91283d1e5bf 100644 --- a/net/ieee802154/6lowpan/reassembly.c +++ b/net/ieee802154/6lowpan/reassembly.c @@ -318,7 +318,7 @@ err: } #ifdef CONFIG_SYSCTL -static unsigned long lowpanfrag_low_thresh_unuesd = IPV6_FRAG_LOW_THRESH; + static struct ctl_table lowpan_frags_ns_ctl_table[] = { { .procname = "6lowpanfrag_high_thresh", @@ -374,9 +374,9 @@ static int __net_init lowpan_frags_ns_sysctl_register(struct net *net) } table[0].data = &ieee802154_lowpan->fqdir->high_thresh; - table[0].extra1 = &lowpanfrag_low_thresh_unuesd; - table[1].data = &lowpanfrag_low_thresh_unuesd; - table[1].extra2 = &ieee802154_lowpan->fqdir->high_thresh; + table[0].extra1 = &ieee802154_lowpan->fqdir->low_thresh; + table[1].data = &ieee802154_lowpan->fqdir->low_thresh; + table[1].extra2 = &ieee802154_lowpan->fqdir->high_thresh; table[2].data = &ieee802154_lowpan->fqdir->timeout; hdr = register_net_sysctl(net, "net/ieee802154/6lowpan", table); @@ -451,6 +451,7 @@ static int __net_init lowpan_frags_init_net(struct net *net) return res; ieee802154_lowpan->fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; + ieee802154_lowpan->fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; ieee802154_lowpan->fqdir->timeout = IPV6_FRAG_TIMEOUT; res = lowpan_frags_ns_sysctl_register(net); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 0db5eb3dec83..69c00ffdcf3e 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -553,7 +553,7 @@ EXPORT_SYMBOL(ip_check_defrag); #ifdef CONFIG_SYSCTL static int dist_min; -static unsigned long ipfrag_low_thresh_unused; + static struct ctl_table ip4_frags_ns_ctl_table[] = { { .procname = "ipfrag_high_thresh", @@ -609,9 +609,9 @@ static int __net_init ip4_frags_ns_ctl_register(struct net *net) } table[0].data = &net->ipv4.fqdir->high_thresh; - table[0].extra1 = &ipfrag_low_thresh_unused; - table[1].data = &ipfrag_low_thresh_unused; - table[1].extra2 = &net->ipv4.fqdir->high_thresh; + table[0].extra1 = &net->ipv4.fqdir->low_thresh; + table[1].data = &net->ipv4.fqdir->low_thresh; + table[1].extra2 = &net->ipv4.fqdir->high_thresh; table[2].data = &net->ipv4.fqdir->timeout; table[3].data = &net->ipv4.fqdir->max_dist; @@ -674,9 +674,12 @@ static int __net_init ipv4_frags_init_net(struct net *net) * A 64K fragment consumes 129736 bytes (44*2944)+200 * (1500 truesize == 2944, sizeof(struct ipq) == 200) * - * We will commit 4MB at one time. Should we cross that limit. + * We will commit 4MB at one time. Should we cross that limit + * we will prune down to 3MB, making room for approx 8 big 64K + * fragments 8x128k. */ net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024; + net->ipv4.fqdir->low_thresh = 3 * 1024 * 1024; /* * Important NOTE! Fragment queue must be destroyed before MSL expires. * RFC791 is wrong proposing to prolongate timer each fragment arrival diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index dc8a2854e7f3..d13240f13607 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -42,7 +42,7 @@ static struct nft_ct_frag6_pernet *nf_frag_pernet(struct net *net) } #ifdef CONFIG_SYSCTL -static unsigned long nf_conntrack_frag6_low_thresh_unused = IPV6_FRAG_LOW_THRESH; + static struct ctl_table nf_ct_frag6_sysctl_table[] = { { .procname = "nf_conntrack_frag6_timeout", @@ -82,10 +82,10 @@ static int nf_ct_frag6_sysctl_register(struct net *net) nf_frag = nf_frag_pernet(net); table[0].data = &nf_frag->fqdir->timeout; - table[1].data = &nf_conntrack_frag6_low_thresh_unused; - table[1].extra2 = &nf_frag->fqdir->high_thresh; + table[1].data = &nf_frag->fqdir->low_thresh; + table[1].extra2 = &nf_frag->fqdir->high_thresh; table[2].data = &nf_frag->fqdir->high_thresh; - table[2].extra1 = &nf_conntrack_frag6_low_thresh_unused; + table[2].extra1 = &nf_frag->fqdir->low_thresh; hdr = register_net_sysctl(net, "net/netfilter", table); if (hdr == NULL) @@ -500,6 +500,7 @@ static int nf_ct_net_init(struct net *net) return res; nf_frag->fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; + nf_frag->fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; nf_frag->fqdir->timeout = IPV6_FRAG_TIMEOUT; res = nf_ct_frag6_sysctl_register(net); diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index eb8373c25675..5bc8a28e67f9 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -416,7 +416,7 @@ static const struct inet6_protocol frag_protocol = { }; #ifdef CONFIG_SYSCTL -static unsigned long ip6_frags_low_thresh_unused = IPV6_FRAG_LOW_THRESH; + static struct ctl_table ip6_frags_ns_ctl_table[] = { { .procname = "ip6frag_high_thresh", @@ -465,9 +465,9 @@ static int __net_init ip6_frags_ns_sysctl_register(struct net *net) } table[0].data = &net->ipv6.fqdir->high_thresh; - table[0].extra1 = &ip6_frags_low_thresh_unused; - table[1].data = &ip6_frags_low_thresh_unused; - table[1].extra2 = &net->ipv6.fqdir->high_thresh; + table[0].extra1 = &net->ipv6.fqdir->low_thresh; + table[1].data = &net->ipv6.fqdir->low_thresh; + table[1].extra2 = &net->ipv6.fqdir->high_thresh; table[2].data = &net->ipv6.fqdir->timeout; hdr = register_net_sysctl(net, "net/ipv6", table); @@ -536,6 +536,7 @@ static int __net_init ipv6_frags_init_net(struct net *net) return res; net->ipv6.fqdir->high_thresh = IPV6_FRAG_HIGH_THRESH; + net->ipv6.fqdir->low_thresh = IPV6_FRAG_LOW_THRESH; net->ipv6.fqdir->timeout = IPV6_FRAG_TIMEOUT; res = ip6_frags_ns_sysctl_register(net); -- cgit v1.2.3 From af2eab1a824349cfb0f6a720ad06eea48e9e6b74 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Wed, 17 May 2023 10:26:02 +0200 Subject: dt-bindings: net: nxp,sja1105: document spi-cpol/cpha Some boards use SJA1105 Ethernet Switch with SPI CPHA, while ones with SJA1110 use SPI CPOL, so document this to fix dtbs_check warnings: arch/arm64/boot/dts/freescale/fsl-lx2160a-bluebox3.dtb: ethernet-switch@0: Unevaluated properties are not allowed ('spi-cpol' was unexpected) Reviewed-by: Conor Dooley Reviewed-by: Vladimir Oltean Signed-off-by: Krzysztof Kozlowski Signed-off-by: David S. Miller --- .../devicetree/bindings/net/dsa/nxp,sja1105.yaml | 32 +++++++++++++++++++--- 1 file changed, 28 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml index 9a64ed658745..4d5f5cc6d031 100644 --- a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml +++ b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml @@ -12,10 +12,6 @@ description: cs_sck_delay of 500ns. Ensuring that this SPI timing requirement is observed depends on the SPI bus master driver. -allOf: - - $ref: dsa.yaml#/$defs/ethernet-ports - - $ref: /schemas/spi/spi-peripheral-props.yaml# - maintainers: - Vladimir Oltean @@ -36,6 +32,9 @@ properties: reg: maxItems: 1 + spi-cpha: true + spi-cpol: true + # Optional container node for the 2 internal MDIO buses of the SJA1110 # (one for the internal 100base-T1 PHYs and the other for the single # 100base-TX PHY). The "reg" property does not have physical significance. @@ -109,6 +108,30 @@ $defs: 1860, 1880, 1900, 1920, 1940, 1960, 1980, 2000, 2020, 2040, 2060, 2080, 2100, 2120, 2140, 2160, 2180, 2200, 2220, 2240, 2260] +allOf: + - $ref: dsa.yaml#/$defs/ethernet-ports + - $ref: /schemas/spi/spi-peripheral-props.yaml# + - if: + properties: + compatible: + enum: + - nxp,sja1105e + - nxp,sja1105p + - nxp,sja1105q + - nxp,sja1105r + - nxp,sja1105s + - nxp,sja1105t + then: + properties: + spi-cpol: false + required: + - spi-cpha + else: + properties: + spi-cpha: false + required: + - spi-cpol + unevaluatedProperties: false examples: @@ -120,6 +143,7 @@ examples: ethernet-switch@1 { reg = <0x1>; compatible = "nxp,sja1105t"; + spi-cpha; ethernet-ports { #address-cells = <1>; -- cgit v1.2.3 From 8819495a754e71d3c3fde991c26ad832af995136 Mon Sep 17 00:00:00 2001 From: Dave Thaler Date: Tue, 9 May 2023 18:08:45 +0000 Subject: bpf, docs: Shift operations are defined to use a mask Update the documentation regarding shift operations to explain the use of a mask, since otherwise shifting by a value out of range (like negative) is undefined. Signed-off-by: Dave Thaler Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20230509180845.1236-1-dthaler1968@googlemail.com --- Documentation/bpf/instruction-set.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index 492980ece1ab..6644842cd3ea 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -163,13 +163,13 @@ BPF_MUL 0x20 dst \*= src BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0 BPF_OR 0x40 dst \|= src BPF_AND 0x50 dst &= src -BPF_LSH 0x60 dst <<= src -BPF_RSH 0x70 dst >>= src +BPF_LSH 0x60 dst <<= (src & mask) +BPF_RSH 0x70 dst >>= (src & mask) BPF_NEG 0x80 dst = ~src BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst BPF_XOR 0xa0 dst ^= src BPF_MOV 0xb0 dst = src -BPF_ARSH 0xc0 sign extending shift right +BPF_ARSH 0xc0 sign extending dst >>= (src & mask) BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ======== ===== ========================================================== @@ -204,6 +204,9 @@ for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result interpreted as an unsigned 64-bit value. There are no instructions for signed division or modulo. +Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31) +for 32-bit operations. + Byte swap instructions ~~~~~~~~~~~~~~~~~~~~~~ -- cgit v1.2.3 From 1c769b1a303f7a3b447fc7244340b77823bdbfdc Mon Sep 17 00:00:00 2001 From: Dave Ertman Date: Tue, 16 May 2023 13:30:55 +0200 Subject: ice: Remove LAG+SRIOV mutual exclusion There was a change previously to stop SR-IOV and LAG from existing on the same interface. This was to prevent the violation of LACP (Link Aggregation Control Protocol). The method to achieve this was to add a no-op Rx handler onto the netdev when SR-IOV VFs were present, thus blocking bonding, bridging, etc from claiming the interface by adding its own Rx handler. Also, when an interface was added into a aggregate, then the SR-IOV capability was set to false. There are some users that have in house solutions using both SR-IOV and bridging/bonding that this method interferes with (e.g. creating duplicate VFs on the bonded interfaces and failing between them when the interface fails over). It makes more sense to provide the most functionality possible, the restriction on co-existence of these features will be removed. No additional functionality is currently being provided beyond what existed before the co-existence restriction was put into place. It is up to the end user to not implement a solution that would interfere with existing network protocols. Reviewed-by: Michal Swiatkowski Signed-off-by: Dave Ertman Signed-off-by: Wojciech Drewek Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- .../device_drivers/ethernet/intel/ice.rst | 18 -------- drivers/net/ethernet/intel/ice/ice.h | 19 -------- drivers/net/ethernet/intel/ice/ice_lag.c | 12 ----- drivers/net/ethernet/intel/ice/ice_lag.h | 54 ---------------------- drivers/net/ethernet/intel/ice/ice_lib.c | 2 - drivers/net/ethernet/intel/ice/ice_sriov.c | 4 -- 6 files changed, 109 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 69695e5511f4..e4d065c55ea8 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -84,24 +84,6 @@ Once the VM shuts down, or otherwise releases the VF, the command will complete. -Important notes for SR-IOV and Link Aggregation ------------------------------------------------ -Link Aggregation is mutually exclusive with SR-IOV. - -- If Link Aggregation is active, SR-IOV VFs cannot be created on the PF. -- If SR-IOV is active, you cannot set up Link Aggregation on the interface. - -Bridging and MACVLAN are also affected by this. If you wish to use bridging or -MACVLAN with SR-IOV, you must set up bridging or MACVLAN before enabling -SR-IOV. If you are using bridging or MACVLAN in conjunction with SR-IOV, and -you want to remove the interface from the bridge or MACVLAN, you must follow -these steps: - -1. Destroy SR-IOV VFs if they exist -2. Remove the interface from the bridge or MACVLAN -3. Recreate SRIOV VFs as needed - - Additional Features and Configurations ====================================== diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index 8b016511561f..b4bca1d964a9 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -814,25 +814,6 @@ static inline bool ice_is_switchdev_running(struct ice_pf *pf) return pf->switchdev.is_running; } -/** - * ice_set_sriov_cap - enable SRIOV in PF flags - * @pf: PF struct - */ -static inline void ice_set_sriov_cap(struct ice_pf *pf) -{ - if (pf->hw.func_caps.common_cap.sr_iov_1_1) - set_bit(ICE_FLAG_SRIOV_CAPABLE, pf->flags); -} - -/** - * ice_clear_sriov_cap - disable SRIOV in PF flags - * @pf: PF struct - */ -static inline void ice_clear_sriov_cap(struct ice_pf *pf) -{ - clear_bit(ICE_FLAG_SRIOV_CAPABLE, pf->flags); -} - #define ICE_FD_STAT_CTR_BLOCK_COUNT 256 #define ICE_FD_STAT_PF_IDX(base_idx) \ ((base_idx) * ICE_FD_STAT_CTR_BLOCK_COUNT) diff --git a/drivers/net/ethernet/intel/ice/ice_lag.c b/drivers/net/ethernet/intel/ice/ice_lag.c index ee5b36941ba3..5a7753bda324 100644 --- a/drivers/net/ethernet/intel/ice/ice_lag.c +++ b/drivers/net/ethernet/intel/ice/ice_lag.c @@ -6,15 +6,6 @@ #include "ice.h" #include "ice_lag.h" -/** - * ice_lag_nop_handler - no-op Rx handler to disable LAG - * @pskb: pointer to skb pointer - */ -rx_handler_result_t ice_lag_nop_handler(struct sk_buff __always_unused **pskb) -{ - return RX_HANDLER_PASS; -} - /** * ice_lag_set_primary - set PF LAG state as Primary * @lag: LAG info struct @@ -158,7 +149,6 @@ ice_lag_link(struct ice_lag *lag, struct netdev_notifier_changeupper_info *info) lag->upper_netdev = upper; } - ice_clear_sriov_cap(pf); ice_clear_rdma_cap(pf); lag->bonded = true; @@ -205,7 +195,6 @@ ice_lag_unlink(struct ice_lag *lag, } lag->peer_netdev = NULL; - ice_set_sriov_cap(pf); ice_set_rdma_cap(pf); lag->bonded = false; lag->role = ICE_LAG_NONE; @@ -229,7 +218,6 @@ static void ice_lag_unregister(struct ice_lag *lag, struct net_device *netdev) if (lag->upper_netdev) { dev_put(lag->upper_netdev); lag->upper_netdev = NULL; - ice_set_sriov_cap(pf); ice_set_rdma_cap(pf); } /* perform some cleanup in case we come back */ diff --git a/drivers/net/ethernet/intel/ice/ice_lag.h b/drivers/net/ethernet/intel/ice/ice_lag.h index 51b5cf467ce2..2c373676c42f 100644 --- a/drivers/net/ethernet/intel/ice/ice_lag.h +++ b/drivers/net/ethernet/intel/ice/ice_lag.h @@ -25,63 +25,9 @@ struct ice_lag { struct notifier_block notif_block; u8 bonded:1; /* currently bonded */ u8 primary:1; /* this is primary */ - u8 handler:1; /* did we register a rx_netdev_handler */ - /* each thing blocking bonding will increment this value by one. - * If this value is zero, then bonding is allowed. - */ - u16 dis_lag; u8 role; }; int ice_init_lag(struct ice_pf *pf); void ice_deinit_lag(struct ice_pf *pf); -rx_handler_result_t ice_lag_nop_handler(struct sk_buff **pskb); - -/** - * ice_disable_lag - increment LAG disable count - * @lag: LAG struct - */ -static inline void ice_disable_lag(struct ice_lag *lag) -{ - /* If LAG this PF is not already disabled, disable it */ - rtnl_lock(); - if (!netdev_is_rx_handler_busy(lag->netdev)) { - if (!netdev_rx_handler_register(lag->netdev, - ice_lag_nop_handler, - NULL)) - lag->handler = true; - } - rtnl_unlock(); - lag->dis_lag++; -} - -/** - * ice_enable_lag - decrement disable count for a PF - * @lag: LAG struct - * - * Decrement the disable counter for a port, and if that count reaches - * zero, then remove the no-op Rx handler from that netdev - */ -static inline void ice_enable_lag(struct ice_lag *lag) -{ - if (lag->dis_lag) - lag->dis_lag--; - if (!lag->dis_lag && lag->handler) { - rtnl_lock(); - netdev_rx_handler_unregister(lag->netdev); - rtnl_unlock(); - lag->handler = false; - } -} - -/** - * ice_is_lag_dis - is LAG disabled - * @lag: LAG struct - * - * Return true if bonding is disabled - */ -static inline bool ice_is_lag_dis(struct ice_lag *lag) -{ - return !!(lag->dis_lag); -} #endif /* _ICE_LAG_H_ */ diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index 387bb9cbafbe..3de9556b89ac 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -2707,8 +2707,6 @@ ice_vsi_setup(struct ice_pf *pf, struct ice_vsi_cfg_params *params) return vsi; err_vsi_cfg: - if (params->type == ICE_VSI_VF) - ice_enable_lag(pf->lag); ice_vsi_free(vsi); return NULL; diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c index 80c643fb9f2f..a7e7debb1428 100644 --- a/drivers/net/ethernet/intel/ice/ice_sriov.c +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c @@ -979,8 +979,6 @@ int ice_sriov_configure(struct pci_dev *pdev, int num_vfs) if (!num_vfs) { if (!pci_vfs_assigned(pdev)) { ice_free_vfs(pf); - if (pf->lag) - ice_enable_lag(pf->lag); return 0; } @@ -992,8 +990,6 @@ int ice_sriov_configure(struct pci_dev *pdev, int num_vfs) if (err) return err; - if (pf->lag) - ice_disable_lag(pf->lag); return num_vfs; } -- cgit v1.2.3 From bddd2e561b0ad5ca42e16fb26a20fc806d521912 Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Tue, 23 May 2023 10:37:48 +0100 Subject: tools: ynl: Handle byte-order in struct members Add support for byte-order in struct members in the genetlink-legacy spec. Signed-off-by: Donald Hunter Acked-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/genetlink-legacy.yaml | 2 ++ tools/net/ynl/lib/nlspec.py | 4 +++- tools/net/ynl/lib/ynl.py | 6 +++--- 3 files changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index b33541a51d6b..b5319cde9e17 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -122,6 +122,8 @@ properties: enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ] len: $ref: '#/$defs/len-or-define' + byte-order: + enum: [ little-endian, big-endian ] # End genetlink-legacy attribute-sets: diff --git a/tools/net/ynl/lib/nlspec.py b/tools/net/ynl/lib/nlspec.py index a0241add3839..c624cdfde223 100644 --- a/tools/net/ynl/lib/nlspec.py +++ b/tools/net/ynl/lib/nlspec.py @@ -226,11 +226,13 @@ class SpecStructMember(SpecElement): Represents a single struct member attribute. Attributes: - type string, type of the member attribute + type string, type of the member attribute + byte_order string or None for native byte order """ def __init__(self, family, yaml): super().__init__(family, yaml) self.type = yaml['type'] + self.byte_order = yaml.get('byte-order') class SpecStruct(SpecElement): diff --git a/tools/net/ynl/lib/ynl.py b/tools/net/ynl/lib/ynl.py index 6185ba27f2e7..39a2296c0003 100644 --- a/tools/net/ynl/lib/ynl.py +++ b/tools/net/ynl/lib/ynl.py @@ -124,7 +124,7 @@ class NlAttr: offset = 0 for m in members: # TODO: handle non-scalar members - format = self.get_format(m.type) + format = self.get_format(m.type, m.byte_order) decoded = format.unpack_from(self.raw, offset) offset += format.size value[m.name] = decoded[0] @@ -305,7 +305,7 @@ class GenlMsg: self.fixed_header_attrs = dict() for m in fixed_header_members: - format = NlAttr.get_format(m.type) + format = NlAttr.get_format(m.type, m.byte_order) decoded = format.unpack_from(nl_msg.raw, offset) offset += format.size self.fixed_header_attrs[m.name] = decoded[0] @@ -542,7 +542,7 @@ class YnlFamily(SpecFamily): fixed_header_members = self.consts[op.fixed_header].members for m in fixed_header_members: value = vals.pop(m.name) - format = NlAttr.get_format(m.type) + format = NlAttr.get_format(m.type, m.byte_order) msg += format.pack(value) for name, value in vals.items(): msg += self._add_attr(op.attr_set.name, name, value) -- cgit v1.2.3 From 7016eb738651ed1dfeef2bbf266bc7dac734067d Mon Sep 17 00:00:00 2001 From: Antoine Tenart Date: Tue, 23 May 2023 18:14:53 +0200 Subject: Documentation: net: net.core.txrehash is not specific to listening sockets The net.core.txrehash documentation mentions this knob is for listening sockets only, while sk_rethink_txhash can be called on SYN and RTO retransmits on all TCP sockets. Remove the listening socket part. Signed-off-by: Antoine Tenart Reviewed-by: Eric Dumazet Signed-off-by: Paolo Abeni --- Documentation/admin-guide/sysctl/net.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 466c560b0c30..4877563241f3 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -386,8 +386,8 @@ Default : 0 (for compatibility reasons) txrehash -------- -Controls default hash rethink behaviour on listening socket when SO_TXREHASH -option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). +Controls default hash rethink behaviour on socket when SO_TXREHASH option is set +to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). If set to 1 (default), hash rethink is performed on listening socket. If set to 0, hash rethink is not performed. -- cgit v1.2.3 From 6d6bae63053d547e96fe4d2c9c8b4fc595bfc5ac Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Sat, 27 May 2023 14:31:04 +0100 Subject: doc: ynl: Add doc attr to struct members in genetlink-legacy spec Make it possible to document the meaning of struct member attributes in genetlink-legacy specs. Signed-off-by: Donald Hunter Signed-off-by: Jakub Kicinski --- Documentation/netlink/genetlink-legacy.yaml | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index b5319cde9e17..d8f132114308 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -124,6 +124,9 @@ properties: $ref: '#/$defs/len-or-define' byte-order: enum: [ little-endian, big-endian ] + doc: + description: Documentation for the struct member attribute. + type: string # End genetlink-legacy attribute-sets: -- cgit v1.2.3 From 313a7a808ca8ca0fe08e2175eb145479bd86937e Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Sat, 27 May 2023 14:31:06 +0100 Subject: tools: ynl: Support enums in struct members in genetlink-legacy Support decoding scalars as enums in struct members for genetlink-legacy specs. Signed-off-by: Donald Hunter Signed-off-by: Jakub Kicinski --- Documentation/netlink/genetlink-legacy.yaml | 3 +++ tools/net/ynl/lib/nlspec.py | 2 ++ tools/net/ynl/lib/ynl.py | 6 +++++- 3 files changed, 10 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index d8f132114308..ac4350498f5e 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -127,6 +127,9 @@ properties: doc: description: Documentation for the struct member attribute. type: string + enum: + description: Name of the enum type used for the attribute. + type: string # End genetlink-legacy attribute-sets: diff --git a/tools/net/ynl/lib/nlspec.py b/tools/net/ynl/lib/nlspec.py index c624cdfde223..ada22b073aa2 100644 --- a/tools/net/ynl/lib/nlspec.py +++ b/tools/net/ynl/lib/nlspec.py @@ -228,11 +228,13 @@ class SpecStructMember(SpecElement): Attributes: type string, type of the member attribute byte_order string or None for native byte order + enum string, name of the enum definition """ def __init__(self, family, yaml): super().__init__(family, yaml) self.type = yaml['type'] self.byte_order = yaml.get('byte-order') + self.enum = yaml.get('enum') class SpecStruct(SpecElement): diff --git a/tools/net/ynl/lib/ynl.py b/tools/net/ynl/lib/ynl.py index 85ee6a4bee72..0692293447ad 100644 --- a/tools/net/ynl/lib/ynl.py +++ b/tools/net/ynl/lib/ynl.py @@ -412,7 +412,11 @@ class YnlFamily(SpecFamily): def _decode_binary(self, attr, attr_spec): if attr_spec.struct_name: - decoded = attr.as_struct(self.consts[attr_spec.struct_name]) + members = self.consts[attr_spec.struct_name] + decoded = attr.as_struct(members) + for m in members: + if m.enum: + self._decode_enum(decoded, m) elif attr_spec.sub_type: decoded = attr.as_c_array(attr_spec.sub_type) else: -- cgit v1.2.3 From 93b230b549bcb4daed82d617f3f6f9d6d118befe Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Sat, 27 May 2023 14:31:07 +0100 Subject: netlink: specs: add ynl spec for ovs_flow Add a ynl specification for ovs_flow. This spec is sufficient to dump ovs flows. Some attrs are left as binary blobs because ynl doesn't support C arrays in struct definitions yet. Signed-off-by: Donald Hunter Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/ovs_flow.yaml | 831 ++++++++++++++++++++++++++++++ 1 file changed, 831 insertions(+) create mode 100644 Documentation/netlink/specs/ovs_flow.yaml (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ovs_flow.yaml b/Documentation/netlink/specs/ovs_flow.yaml new file mode 100644 index 000000000000..3b0624c87074 --- /dev/null +++ b/Documentation/netlink/specs/ovs_flow.yaml @@ -0,0 +1,831 @@ +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) + +name: ovs_flow +version: 1 +protocol: genetlink-legacy + +doc: + OVS flow configuration over generic netlink. + +definitions: + - + name: ovs-header + type: struct + doc: | + Header for OVS Generic Netlink messages. + members: + - + name: dp-ifindex + type: u32 + doc: | + ifindex of local port for datapath (0 to make a request not specific + to a datapath). + - + name: ovs-flow-stats + type: struct + members: + - + name: n-packets + type: u64 + doc: Number of matched packets. + - + name: n-bytes + type: u64 + doc: Number of matched bytes. + - + name: ovs-key-mpls + type: struct + members: + - + name: mpls-lse + type: u32 + byte-order: big-endian + - + name: ovs-key-ipv4 + type: struct + members: + - + name: ipv4-src + type: u32 + byte-order: big-endian + - + name: ipv4-dst + type: u32 + byte-order: big-endian + - + name: ipv4-proto + type: u8 + - + name: ipv4-tos + type: u8 + - + name: ipv4-ttl + type: u8 + - + name: ipv4-frag + type: u8 + enum: ovs-frag-type + - + name: ovs-frag-type + type: enum + entries: + - + name: none + doc: Packet is not a fragment. + - + name: first + doc: Packet is a fragment with offset 0. + - + name: later + doc: Packet is a fragment with nonzero offset. + - + name: any + value: 255 + - + name: ovs-key-tcp + type: struct + members: + - + name: tcp-src + type: u16 + byte-order: big-endian + - + name: tcp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-udp + type: struct + members: + - + name: udp-src + type: u16 + byte-order: big-endian + - + name: udp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-sctp + type: struct + members: + - + name: sctp-src + type: u16 + byte-order: big-endian + - + name: sctp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-icmp + type: struct + members: + - + name: icmp-type + type: u8 + - + name: icmp-code + type: u8 + - + name: ovs-key-ct-tuple-ipv4 + type: struct + members: + - + name: ipv4-src + type: u32 + byte-order: big-endian + - + name: ipv4-dst + type: u32 + byte-order: big-endian + - + name: src-port + type: u16 + byte-order: big-endian + - + name: dst-port + type: u16 + byte-order: big-endian + - + name: ipv4-proto + type: u8 + - + name: ovs-action-push-vlan + type: struct + members: + - + name: vlan_tpid + type: u16 + byte-order: big-endian + doc: Tag protocol identifier (TPID) to push. + - + name: vlan_tci + type: u16 + byte-order: big-endian + doc: Tag control identifier (TCI) to push. + - + name: ovs-ufid-flags + type: flags + entries: + - omit-key + - omit-mask + - omit-actions + - + name: ovs-action-hash + type: struct + members: + - + name: hash-algorithm + type: u32 + doc: Algorithm used to compute hash prior to recirculation. + - + name: hash-basis + type: u32 + doc: Basis used for computing hash. + - + name: ovs-hash-alg + type: enum + doc: | + Data path hash algorithm for computing Datapath hash. The algorithm type only specifies + the fields in a flow will be used as part of the hash. Each datapath is free to use its + own hash algorithm. The hash value will be opaque to the user space daemon. + entries: + - ovs-hash-alg-l4 + + - + name: ovs-action-push-mpls + type: struct + members: + - + name: lse + type: u32 + byte-order: big-endian + doc: | + MPLS label stack entry to push + - + name: ethertype + type: u32 + byte-order: big-endian + doc: | + Ethertype to set in the encapsulating ethernet frame. The only values + ethertype should ever be given are ETH_P_MPLS_UC and ETH_P_MPLS_MC, + indicating MPLS unicast or multicast. Other are rejected. + - + name: ovs-action-add-mpls + type: struct + members: + - + name: lse + type: u32 + byte-order: big-endian + doc: | + MPLS label stack entry to push + - + name: ethertype + type: u32 + byte-order: big-endian + doc: | + Ethertype to set in the encapsulating ethernet frame. The only values + ethertype should ever be given are ETH_P_MPLS_UC and ETH_P_MPLS_MC, + indicating MPLS unicast or multicast. Other are rejected. + - + name: tun-flags + type: u16 + doc: | + MPLS tunnel attributes. + - + name: ct-state-flags + type: flags + entries: + - + name: new + doc: Beginning of a new connection. + - + name: established + doc: Part of an existing connenction + - + name: related + doc: Related to an existing connection. + - + name: reply-dir + doc: Flow is in the reply direction. + - + name: invalid + doc: Could not track the connection. + - + name: tracked + doc: Conntrack has occurred. + - + name: src-nat + doc: Packet's source address/port was mangled by NAT. + - + name: dst-nat + doc: Packet's destination address/port was mangled by NAT. + +attribute-sets: + - + name: flow-attrs + attributes: + - + name: key + type: nest + nested-attributes: key-attrs + doc: | + Nested attributes specifying the flow key. Always present in + notifications. Required for all requests (except dumps). + - + name: actions + type: nest + nested-attributes: action-attrs + doc: | + Nested attributes specifying the actions to take for packets that + match the key. Always present in notifications. Required for + OVS_FLOW_CMD_NEW requests, optional for OVS_FLOW_CMD_SET requests. An + OVS_FLOW_CMD_SET without OVS_FLOW_ATTR_ACTIONS will not modify the + actions. To clear the actions, an OVS_FLOW_ATTR_ACTIONS without any + nested attributes must be given. + - + name: stats + type: binary + struct: ovs-flow-stats + doc: | + Statistics for this flow. Present in notifications if the stats would + be nonzero. Ignored in requests. + - + name: tcp-flags + type: u8 + doc: | + An 8-bit value giving the ORed value of all of the TCP flags seen on + packets in this flow. Only present in notifications for TCP flows, and + only if it would be nonzero. Ignored in requests. + - + name: used + type: u64 + doc: | + A 64-bit integer giving the time, in milliseconds on the system + monotonic clock, at which a packet was last processed for this + flow. Only present in notifications if a packet has been processed for + this flow. Ignored in requests. + - + name: clear + type: flag + doc: | + If present in a OVS_FLOW_CMD_SET request, clears the last-used time, + accumulated TCP flags, and statistics for this flow. Otherwise + ignored in requests. Never present in notifications. + - + name: mask + type: nest + nested-attributes: key-attrs + doc: | + Nested attributes specifying the mask bits for wildcarded flow + match. Mask bit value '1' specifies exact match with corresponding + flow key bit, while mask bit value '0' specifies a wildcarded + match. Omitting attribute is treated as wildcarding all corresponding + fields. Optional for all requests. If not present, all flow key bits + are exact match bits. + - + name: probe + type: binary + doc: | + Flow operation is a feature probe, error logging should be suppressed. + - + name: ufid + type: binary + doc: | + A value between 1-16 octets specifying a unique identifier for the + flow. Causes the flow to be indexed by this value rather than the + value of the OVS_FLOW_ATTR_KEY attribute. Optional for all + requests. Present in notifications if the flow was created with this + attribute. + - + name: ufid-flags + type: u32 + enum: ovs-ufid-flags + doc: | + A 32-bit value of ORed flags that provide alternative semantics for + flow installation and retrieval. Optional for all requests. + - + name: pad + type: binary + + - + name: key-attrs + attributes: + - + name: encap + type: nest + nested-attributes: key-attrs + - + name: priority + type: u32 + - + name: in-port + type: u32 + - + name: ethernet + type: binary + doc: struct ovs_key_ethernet + - + name: vlan + type: u16 + byte-order: big-endian + - + name: ethertype + type: u16 + byte-order: big-endian + - + name: ipv4 + type: binary + struct: ovs-key-ipv4 + - + name: ipv6 + type: binary + doc: struct ovs_key_ipv6 + - + name: tcp + type: binary + struct: ovs-key-tcp + - + name: udp + type: binary + struct: ovs-key-udp + - + name: icmp + type: binary + struct: ovs-key-icmp + - + name: icmpv6 + type: binary + struct: ovs-key-icmp + - + name: arp + type: binary + doc: struct ovs_key_arp + - + name: nd + type: binary + doc: struct ovs_key_nd + - + name: skb-mark + type: u32 + - + name: tunnel + type: nest + nested-attributes: tunnel-key-attrs + - + name: sctp + type: binary + struct: ovs-key-sctp + - + name: tcp-flags + type: u16 + byte-order: big-endian + - + name: dp-hash + type: u32 + doc: Value 0 indicates the hash is not computed by the datapath. + - + name: recirc-id + type: u32 + - + name: mpls + type: binary + struct: ovs-key-mpls + - + name: ct-state + type: u32 + enum: ct-state-flags + enum-as-flags: true + - + name: ct-zone + type: u16 + doc: connection tracking zone + - + name: ct-mark + type: u32 + doc: connection tracking mark + - + name: ct-labels + type: binary + doc: 16-octet connection tracking label + - + name: ct-orig-tuple-ipv4 + type: binary + struct: ovs-key-ct-tuple-ipv4 + - + name: ct-orig-tuple-ipv6 + type: binary + doc: struct ovs_key_ct_tuple_ipv6 + - + name: nsh + type: nest + nested-attributes: ovs-nsh-key-attrs + - + name: packet-type + type: u32 + byte-order: big-endian + doc: Should not be sent to the kernel + - + name: nd-extensions + type: binary + doc: Should not be sent to the kernel + - + name: tunnel-info + type: binary + doc: struct ip_tunnel_info + - + name: ipv6-exthdrs + type: binary + doc: struct ovs_key_ipv6_exthdr + - + name: action-attrs + attributes: + - + name: output + type: u32 + doc: ovs port number in datapath + - + name: userspace + type: nest + nested-attributes: userspace-attrs + - + name: set + type: nest + nested-attributes: key-attrs + doc: Replaces the contents of an existing header. The single nested attribute specifies a header to modify and its value. + - + name: push-vlan + type: binary + struct: ovs-action-push-vlan + doc: Push a new outermost 802.1Q or 802.1ad header onto the packet. + - + name: pop-vlan + type: flag + doc: Pop the outermost 802.1Q or 802.1ad header from the packet. + - + name: sample + type: nest + nested-attributes: sample-attrs + doc: | + Probabilistically executes actions, as specified in the nested attributes. + - + name: recirc + type: u32 + doc: recirc id + - + name: hash + type: binary + struct: ovs-action-hash + - + name: push-mpls + type: binary + struct: ovs-action-push-mpls + doc: | + Push a new MPLS label stack entry onto the top of the packets MPLS + label stack. Set the ethertype of the encapsulating frame to either + ETH_P_MPLS_UC or ETH_P_MPLS_MC to indicate the new packet contents. + - + name: pop-mpls + type: u16 + byte-order: big-endian + doc: ethertype + - + name: set-masked + type: nest + nested-attributes: key-attrs + doc: | + Replaces the contents of an existing header. A nested attribute + specifies a header to modify, its value, and a mask. For every bit set + in the mask, the corresponding bit value is copied from the value to + the packet header field, rest of the bits are left unchanged. The + non-masked value bits must be passed in as zeroes. Masking is not + supported for the OVS_KEY_ATTR_TUNNEL attribute. + - + name: ct + type: nest + nested-attributes: ct-attrs + doc: | + Track the connection. Populate the conntrack-related entries + in the flow key. + - + name: trunc + type: u32 + doc: struct ovs_action_trunc is a u32 max length + - + name: push-eth + type: binary + doc: struct ovs_action_push_eth + - + name: pop-eth + type: flag + - + name: ct-clear + type: flag + - + name: push-nsh + type: nest + nested-attributes: ovs-nsh-key-attrs + doc: | + Push NSH header to the packet. + - + name: pop-nsh + type: flag + doc: | + Pop the outermost NSH header off the packet. + - + name: meter + type: u32 + doc: | + Run packet through a meter, which may drop the packet, or modify the + packet (e.g., change the DSCP field) + - + name: clone + type: nest + nested-attributes: action-attrs + doc: | + Make a copy of the packet and execute a list of actions without + affecting the original packet and key. + - + name: check-pkt-len + type: nest + nested-attributes: check-pkt-len-attrs + doc: | + Check the packet length and execute a set of actions if greater than + the specified packet length, else execute another set of actions. + - + name: add-mpls + type: binary + struct: ovs-action-add-mpls + doc: | + Push a new MPLS label stack entry at the start of the packet or at the + start of the l3 header depending on the value of l3 tunnel flag in the + tun_flags field of this OVS_ACTION_ATTR_ADD_MPLS argument. + - + name: dec-ttl + type: nest + nested-attributes: dec-ttl-attrs + - + name: tunnel-key-attrs + attributes: + - + name: id + type: u64 + byte-order: big-endian + value: 0 + - + name: ipv4-src + type: u32 + byte-order: big-endian + - + name: ipv4-dst + type: u32 + byte-order: big-endian + - + name: tos + type: u8 + - + name: ttl + type: u8 + - + name: dont-fragment + type: flag + - + name: csum + type: flag + - + name: oam + type: flag + - + name: geneve-opts + type: binary + sub-type: u32 + - + name: tp-src + type: u16 + byte-order: big-endian + - + name: tp-dst + type: u16 + byte-order: big-endian + - + name: vxlan-opts + type: nest + nested-attributes: vxlan-ext-attrs + - + name: ipv6-src + type: binary + doc: | + struct in6_addr source IPv6 address + - + name: ipv6-dst + type: binary + doc: | + struct in6_addr destination IPv6 address + - + name: pad + type: binary + - + name: erspan-opts + type: binary + doc: | + struct erspan_metadata + - + name: ipv4-info-bridge + type: flag + - + name: check-pkt-len-attrs + attributes: + - + name: pkt-len + type: u16 + - + name: actions-if-greater + type: nest + nested-attributes: action-attrs + - + name: actions-if-less-equal + type: nest + nested-attributes: action-attrs + - + name: sample-attrs + attributes: + - + name: probability + type: u32 + - + name: actions + type: nest + nested-attributes: action-attrs + - + name: userspace-attrs + attributes: + - + name: pid + type: u32 + - + name: userdata + type: binary + - + name: egress-tun-port + type: u32 + - + name: actions + type: flag + - + name: ovs-nsh-key-attrs + attributes: + - + name: base + type: binary + - + name: md1 + type: binary + - + name: md2 + type: binary + - + name: ct-attrs + attributes: + - + name: commit + type: flag + - + name: zone + type: u16 + - + name: mark + type: binary + - + name: labels + type: binary + - + name: helper + type: string + - + name: nat + type: nest + nested-attributes: nat-attrs + - + name: force-commit + type: flag + - + name: eventmask + type: u32 + - + name: timeout + type: string + - + name: nat-attrs + attributes: + - + name: src + type: binary + - + name: dst + type: binary + - + name: ip-min + type: binary + - + name: ip-max + type: binary + - + name: proto-min + type: binary + - + name: proto-max + type: binary + - + name: persistent + type: binary + - + name: proto-hash + type: binary + - + name: proto-random + type: binary + - + name: dec-ttl-attrs + attributes: + - + name: action + type: nest + nested-attributes: action-attrs + - + name: vxlan-ext-attrs + attributes: + - + name: gbp + type: u32 + +operations: + fixed-header: ovs-header + list: + - + name: flow-get + doc: Get / dump OVS flow configuration and state + value: 3 + attribute-set: flow-attrs + do: &flow-get-op + request: + attributes: + - dp-ifindex + - key + - ufid + - ufid-flags + reply: + attributes: + - dp-ifindex + - key + - ufid + - mask + - stats + - actions + dump: *flow-get-op + +mcast-groups: + list: + - + name: ovs_flow -- cgit v1.2.3 From 9229a9483d80ba1cc7a75a552afff3e5afdf99e0 Mon Sep 17 00:00:00 2001 From: Alexis Lothoré Date: Mon, 29 May 2023 10:02:40 +0200 Subject: dt-bindings: net: dsa: marvell: add MV88E6361 switch to compatibility list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Marvell MV88E6361 is an 8-port switch derived from the 88E6393X/88E9193X/88E6191X switches family. Since its functional behavior is very close to switches from this family, it can benefit from existing drivers for this family, so add it to the list of compatible switches Signed-off-by: Alexis Lothoré Reviewed-by: Andrew Lunn Acked-by: Conor Dooley Reviewed-by: Florian Fainelli Signed-off-by: Jakub Kicinski --- Documentation/devicetree/bindings/net/dsa/marvell.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt b/Documentation/devicetree/bindings/net/dsa/marvell.txt index 2363b412410c..33726134f5c9 100644 --- a/Documentation/devicetree/bindings/net/dsa/marvell.txt +++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt @@ -20,7 +20,7 @@ which is at a different MDIO base address in different switch families. 6171, 6172, 6175, 6176, 6185, 6240, 6320, 6321, 6341, 6350, 6351, 6352 - "marvell,mv88e6190" : Switch has base address 0x00. Use with models: - 6190, 6190X, 6191, 6290, 6390, 6390X + 6163, 6190, 6190X, 6191, 6290, 6390, 6390X - "marvell,mv88e6250" : Switch has base address 0x08 or 0x18. Use with model: 6220, 6250 -- cgit v1.2.3 From 8aa2fd7b66980ecd2e45e95af61cf7eafede1211 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Mon, 29 May 2023 18:32:33 +0200 Subject: Documentation: leds: leds-class: Document new Hardware driven LEDs APIs Document new Hardware driven LEDs APIs. Some LEDs can be programmed to be driven by hardware. This is not limited to blink but also to turn off or on autonomously. To support this feature, a LED needs to implement various additional ops and needs to declare specific support for the supported triggers. Add documentation for each required value and API to make hw control possible and implementable by both LEDs and triggers. Signed-off-by: Christian Marangi Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/leds/leds-class.rst | 81 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) (limited to 'Documentation') diff --git a/Documentation/leds/leds-class.rst b/Documentation/leds/leds-class.rst index cd155ead8703..5db620ed27aa 100644 --- a/Documentation/leds/leds-class.rst +++ b/Documentation/leds/leds-class.rst @@ -169,6 +169,87 @@ Setting the brightness to zero with brightness_set() callback function should completely turn off the LED and cancel the previously programmed hardware blinking function, if any. +Hardware driven LEDs +==================== + +Some LEDs can be programmed to be driven by hardware. This is not +limited to blink but also to turn off or on autonomously. +To support this feature, a LED needs to implement various additional +ops and needs to declare specific support for the supported triggers. + +With hw control we refer to the LED driven by hardware. + +LED driver must define the following value to support hw control: + + - hw_control_trigger: + unique trigger name supported by the LED in hw control + mode. + +LED driver must implement the following API to support hw control: + - hw_control_is_supported: + check if the flags passed by the supported trigger can + be parsed and activate hw control on the LED. + + Return 0 if the passed flags mask is supported and + can be set with hw_control_set(). + + If the passed flags mask is not supported -EOPNOTSUPP + must be returned, the LED trigger will use software + fallback in this case. + + Return a negative error in case of any other error like + device not ready or timeouts. + + - hw_control_set: + activate hw control. LED driver will use the provided + flags passed from the supported trigger, parse them to + a set of mode and setup the LED to be driven by hardware + following the requested modes. + + Set LED_OFF via the brightness_set to deactivate hw control. + + Return 0 on success, a negative error number on failing to + apply flags. + + - hw_control_get: + get active modes from a LED already in hw control, parse + them and set in flags the current active flags for the + supported trigger. + + Return 0 on success, a negative error number on failing + parsing the initial mode. + Error from this function is NOT FATAL as the device may + be in a not supported initial state by the attached LED + trigger. + + - hw_control_get_device: + return the device associated with the LED driver in + hw control. A trigger might use this to match the + returned device from this function with a configured + device for the trigger as the source for blinking + events and correctly enable hw control. + (example a netdev trigger configured to blink for a + particular dev match the returned dev from get_device + to set hw control) + + Returns a pointer to a struct device or NULL if nothing + is currently attached. + +LED driver can activate additional modes by default to workaround the +impossibility of supporting each different mode on the supported trigger. +Examples are hardcoding the blink speed to a set interval, enable special +feature like bypassing blink if some requirements are not met. + +A trigger should first check if the hw control API are supported by the LED +driver and check if the trigger is supported to verify if hw control is possible, +use hw_control_is_supported to check if the flags are supported and only at +the end use hw_control_set to activate hw control. + +A trigger can use hw_control_get to check if a LED is already in hw control +and init their flags. + +When the LED is in hw control, no software blink is possible and doing so +will effectively disable hw control. Known Issues ============ -- cgit v1.2.3 From bd415f6c748ec3ca0017f9a6f23a5c02900eb6d2 Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Wed, 31 May 2023 12:21:12 +0200 Subject: dt-bindings: net: pse-pd: Allow -N suffix for ethernet-pse node names Extend the pattern matching for PSE-PD controller nodes to allow -N suffixes. This enables the use of multiple "ethernet-pse" nodes without the need for a "reg" property. Signed-off-by: Oleksij Rempel Reviewed-by: Krzysztof Kozlowski Signed-off-by: Jakub Kicinski --- Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml b/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml index b110abb42597..2d382faca0e6 100644 --- a/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml +++ b/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml @@ -16,7 +16,7 @@ maintainers: properties: $nodename: - pattern: "^ethernet-pse(@.*)?$" + pattern: "^ethernet-pse(@.*|-([0-9]|[1-9][0-9]+))?$" "#pse-cells": description: -- cgit v1.2.3 From 86878f14d71af89149a955122afd8b7af1ee9bf2 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Mon, 5 Jun 2023 12:01:06 -0700 Subject: tools: ynl: user space helpers Add "fixed" part of the user space Netlink Spec-based library. This will get linked with the protocol implementations to form a full API. Acked-by: Willem de Bruijn Signed-off-by: Jakub Kicinski --- .../userspace-api/netlink/intro-specs.rst | 79 ++ tools/net/ynl/Makefile | 19 + tools/net/ynl/generated/Makefile | 45 + tools/net/ynl/lib/Makefile | 28 + tools/net/ynl/lib/ynl.c | 901 +++++++++++++++++++++ tools/net/ynl/lib/ynl.h | 237 ++++++ tools/net/ynl/ynl-regen.sh | 2 +- 7 files changed, 1310 insertions(+), 1 deletion(-) create mode 100644 tools/net/ynl/Makefile create mode 100644 tools/net/ynl/generated/Makefile create mode 100644 tools/net/ynl/lib/Makefile create mode 100644 tools/net/ynl/lib/ynl.c create mode 100644 tools/net/ynl/lib/ynl.h (limited to 'Documentation') diff --git a/Documentation/userspace-api/netlink/intro-specs.rst b/Documentation/userspace-api/netlink/intro-specs.rst index a3b847eafff7..bada89699455 100644 --- a/Documentation/userspace-api/netlink/intro-specs.rst +++ b/Documentation/userspace-api/netlink/intro-specs.rst @@ -78,3 +78,82 @@ to see other examples. The code generation itself is performed by ``tools/net/ynl/ynl-gen-c.py`` but it takes a few arguments so calling it directly for each file quickly becomes tedious. + +YNL lib +======= + +``tools/net/ynl/lib/`` contains an implementation of a C library +(based on libmnl) which integrates with code generated by +``tools/net/ynl/ynl-gen-c.py`` to create easy to use netlink wrappers. + +YNL basics +---------- + +The YNL library consists of two parts - the generic code (functions +prefix by ``ynl_``) and per-family auto-generated code (prefixed +with the name of the family). + +To create a YNL socket call ynl_sock_create() passing the family +struct (family structs are exported by the auto-generated code). +ynl_sock_destroy() closes the socket. + +YNL requests +------------ + +Steps for issuing YNL requests are best explained on an example. +All the functions and types in this example come from the auto-generated +code (for the netdev family in this case): + +.. code-block:: c + + // 0. Request and response pointers + struct netdev_dev_get_req *req; + struct netdev_dev_get_rsp *d; + + // 1. Allocate a request + req = netdev_dev_get_req_alloc(); + // 2. Set request parameters (as needed) + netdev_dev_get_req_set_ifindex(req, ifindex); + + // 3. Issues the request + d = netdev_dev_get(ys, req); + // 4. Free the request arguments + netdev_dev_get_req_free(req); + // 5. Error check (the return value from step 3) + if (!d) { + // 6. Print the YNL-generated error + fprintf(stderr, "YNL: %s\n", ys->err.msg); + return -1; + } + + // ... do stuff with the response @d + + // 7. Free response + netdev_dev_get_rsp_free(d); + +YNL dumps +--------- + +Performing dumps follows similar pattern as requests. +Dumps return a list of objects terminated by a special marker, +or NULL on error. Use ``ynl_dump_foreach()`` to iterate over +the result. + +YNL notifications +----------------- + +YNL lib supports using the same socket for notifications and +requests. In case notifications arrive during processing of a request +they are queued internally and can be retrieved at a later time. + +To subscribed to notifications use ``ynl_subscribe()``. +The notifications have to be read out from the socket, +``ynl_socket_get_fd()`` returns the underlying socket fd which can +be plugged into appropriate asynchronous IO API like ``poll``, +or ``select``. + +Notifications can be retrieved using ``ynl_ntf_dequeue()`` and have +to be freed using ``ynl_ntf_free()``. Since we don't know the notification +type upfront the notifications are returned as ``struct ynl_ntf_base_type *`` +and user is expected to cast them to the appropriate full type based +on the ``cmd`` member. diff --git a/tools/net/ynl/Makefile b/tools/net/ynl/Makefile new file mode 100644 index 000000000000..d664b36deb5b --- /dev/null +++ b/tools/net/ynl/Makefile @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 + +SUBDIRS = lib generated samples + +all: $(SUBDIRS) + +$(SUBDIRS): + @if [ -f "$@/Makefile" ] ; then \ + $(MAKE) -C $@ ; \ + fi + +clean hardclean: + @for dir in $(SUBDIRS) ; do \ + if [ -f "$$dir/Makefile" ] ; then \ + $(MAKE) -C $$dir $@; \ + fi \ + done + +.PHONY: clean all $(SUBDIRS) diff --git a/tools/net/ynl/generated/Makefile b/tools/net/ynl/generated/Makefile new file mode 100644 index 000000000000..9a09e581906e --- /dev/null +++ b/tools/net/ynl/generated/Makefile @@ -0,0 +1,45 @@ +# SPDX-License-Identifier: GPL-2.0 + +CC=gcc +CFLAGS=-std=gnu11 -O2 -W -Wall -Wextra -Wno-unused-parameter -Wshadow \ + -I../lib/ +ifeq ("$(DEBUG)","1") + CFLAGS += -g -fsanitize=address -fsanitize=leak -static-libasan +endif + +TOOL:=../ynl-gen-c.py + +GENS:= +SRCS=$(patsubst %,%-user.c,${GENS}) +HDRS=$(patsubst %,%-user.h,${GENS}) +OBJS=$(patsubst %,%-user.o,${GENS}) + +all: protos.a $(HDRS) $(SRCS) $(KHDRS) $(KSRCS) $(UAPI) regen + +protos.a: $(OBJS) + @echo -e "\tAR $@" + @ar rcs $@ $(OBJS) + +%-user.h: ../../../../Documentation/netlink/specs/%.yaml $(TOOL) + @echo -e "\tGEN $@" + @$(TOOL) --mode user --header --spec $< > $@ + +%-user.c: ../../../../Documentation/netlink/specs/%.yaml $(TOOL) + @echo -e "\tGEN $@" + @$(TOOL) --mode user --source --spec $< > $@ + +%-user.o: %-user.c %-user.h + @echo -e "\tCC $@" + @$(COMPILE.c) -c -o $@ $< + +clean: + rm -f *.o + +hardclean: clean + rm -f *.c *.h *.a + +regen: + @../ynl-regen.sh + +.PHONY: all clean hardclean regen +.DEFAULT_GOAL: all diff --git a/tools/net/ynl/lib/Makefile b/tools/net/ynl/lib/Makefile new file mode 100644 index 000000000000..d2e50fd0a52d --- /dev/null +++ b/tools/net/ynl/lib/Makefile @@ -0,0 +1,28 @@ +# SPDX-License-Identifier: GPL-2.0 + +CC=gcc +CFLAGS=-std=gnu11 -O2 -W -Wall -Wextra -Wno-unused-parameter -Wshadow +ifeq ("$(DEBUG)","1") + CFLAGS += -g -fsanitize=address -fsanitize=leak -static-libasan +endif + +SRCS=$(wildcard *.c) +OBJS=$(patsubst %.c,%.o,${SRCS}) + +include $(wildcard *.d) + +all: ynl.a + +ynl.a: $(OBJS) + ar rcs $@ $(OBJS) +clean: + rm -f *.o *.d *~ + +hardclean: clean + rm -f *.a + +%.o: %.c + $(COMPILE.c) -MMD -c -o $@ $< + +.PHONY: all clean +.DEFAULT_GOAL=all diff --git a/tools/net/ynl/lib/ynl.c b/tools/net/ynl/lib/ynl.c new file mode 100644 index 000000000000..514e0d69e731 --- /dev/null +++ b/tools/net/ynl/lib/ynl.c @@ -0,0 +1,901 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +#include +#include +#include +#include +#include + +#include +#include + +#include "ynl.h" + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof(*arr)) + +#define __yerr_msg(yse, _msg...) \ + ({ \ + struct ynl_error *_yse = (yse); \ + \ + if (_yse) { \ + snprintf(_yse->msg, sizeof(_yse->msg) - 1, _msg); \ + _yse->msg[sizeof(_yse->msg) - 1] = 0; \ + } \ + }) + +#define __yerr_code(yse, _code...) \ + ({ \ + struct ynl_error *_yse = (yse); \ + \ + if (_yse) { \ + _yse->code = _code; \ + } \ + }) + +#define __yerr(yse, _code, _msg...) \ + ({ \ + __yerr_msg(yse, _msg); \ + __yerr_code(yse, _code); \ + }) + +#define __perr(yse, _msg) __yerr(yse, errno, _msg) + +#define yerr_msg(_ys, _msg...) __yerr_msg(&(_ys)->err, _msg) +#define yerr(_ys, _code, _msg...) __yerr(&(_ys)->err, _code, _msg) +#define perr(_ys, _msg) __yerr(&(_ys)->err, errno, _msg) + +/* -- Netlink boiler plate */ +static int +ynl_err_walk_report_one(struct ynl_policy_nest *policy, unsigned int type, + char *str, int str_sz, int *n) +{ + if (!policy) { + if (*n < str_sz) + *n += snprintf(str, str_sz, "!policy"); + return 1; + } + + if (type > policy->max_attr) { + if (*n < str_sz) + *n += snprintf(str, str_sz, "!oob"); + return 1; + } + + if (!policy->table[type].name) { + if (*n < str_sz) + *n += snprintf(str, str_sz, "!name"); + return 1; + } + + if (*n < str_sz) + *n += snprintf(str, str_sz - *n, + ".%s", policy->table[type].name); + return 0; +} + +static int +ynl_err_walk(struct ynl_sock *ys, void *start, void *end, unsigned int off, + struct ynl_policy_nest *policy, char *str, int str_sz, + struct ynl_policy_nest **nest_pol) +{ + unsigned int astart_off, aend_off; + const struct nlattr *attr; + unsigned int data_len; + unsigned int type; + bool found = false; + int n = 0; + + if (!policy) { + if (n < str_sz) + n += snprintf(str, str_sz, "!policy"); + return n; + } + + data_len = end - start; + + mnl_attr_for_each_payload(start, data_len) { + astart_off = (char *)attr - (char *)start; + aend_off = astart_off + mnl_attr_get_payload_len(attr); + if (aend_off <= off) + continue; + + found = true; + break; + } + if (!found) + return 0; + + off -= astart_off; + + type = mnl_attr_get_type(attr); + + if (ynl_err_walk_report_one(policy, type, str, str_sz, &n)) + return n; + + if (!off) { + if (nest_pol) + *nest_pol = policy->table[type].nest; + return n; + } + + if (!policy->table[type].nest) { + if (n < str_sz) + n += snprintf(str, str_sz, "!nest"); + return n; + } + + off -= sizeof(struct nlattr); + start = mnl_attr_get_payload(attr); + end = start + mnl_attr_get_payload_len(attr); + + return n + ynl_err_walk(ys, start, end, off, policy->table[type].nest, + &str[n], str_sz - n, nest_pol); +} + +#define NLMSGERR_ATTR_MISS_TYPE (NLMSGERR_ATTR_POLICY + 1) +#define NLMSGERR_ATTR_MISS_NEST (NLMSGERR_ATTR_POLICY + 2) +#define NLMSGERR_ATTR_MAX (NLMSGERR_ATTR_MAX + 2) + +static int +ynl_ext_ack_check(struct ynl_sock *ys, const struct nlmsghdr *nlh, + unsigned int hlen) +{ + const struct nlattr *tb[NLMSGERR_ATTR_MAX + 1] = {}; + char miss_attr[sizeof(ys->err.msg)]; + char bad_attr[sizeof(ys->err.msg)]; + const struct nlattr *attr; + const char *str = NULL; + + if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS)) + return MNL_CB_OK; + + mnl_attr_for_each(attr, nlh, hlen) { + unsigned int len, type; + + len = mnl_attr_get_payload_len(attr); + type = mnl_attr_get_type(attr); + + if (type > NLMSGERR_ATTR_MAX) + continue; + + tb[type] = attr; + + switch (type) { + case NLMSGERR_ATTR_OFFS: + case NLMSGERR_ATTR_MISS_TYPE: + case NLMSGERR_ATTR_MISS_NEST: + if (len != sizeof(__u32)) + return MNL_CB_ERROR; + break; + case NLMSGERR_ATTR_MSG: + str = mnl_attr_get_payload(attr); + if (str[len - 1]) + return MNL_CB_ERROR; + break; + default: + break; + } + } + + bad_attr[0] = '\0'; + miss_attr[0] = '\0'; + + if (tb[NLMSGERR_ATTR_OFFS]) { + unsigned int n, off; + void *start, *end; + + ys->err.attr_offs = mnl_attr_get_u32(tb[NLMSGERR_ATTR_OFFS]); + + n = snprintf(bad_attr, sizeof(bad_attr), "%sbad attribute: ", + str ? " (" : ""); + + start = mnl_nlmsg_get_payload_offset(ys->nlh, + sizeof(struct genlmsghdr)); + end = mnl_nlmsg_get_payload_tail(ys->nlh); + + off = ys->err.attr_offs; + off -= sizeof(struct nlmsghdr); + off -= sizeof(struct genlmsghdr); + + n += ynl_err_walk(ys, start, end, off, ys->req_policy, + &bad_attr[n], sizeof(bad_attr) - n, NULL); + + if (n >= sizeof(bad_attr)) + n = sizeof(bad_attr) - 1; + bad_attr[n] = '\0'; + } + if (tb[NLMSGERR_ATTR_MISS_TYPE]) { + struct ynl_policy_nest *nest_pol = NULL; + unsigned int n, off, type; + void *start, *end; + int n2; + + type = mnl_attr_get_u32(tb[NLMSGERR_ATTR_MISS_TYPE]); + + n = snprintf(miss_attr, sizeof(miss_attr), "%smissing attribute: ", + bad_attr[0] ? ", " : (str ? " (" : "")); + + start = mnl_nlmsg_get_payload_offset(ys->nlh, + sizeof(struct genlmsghdr)); + end = mnl_nlmsg_get_payload_tail(ys->nlh); + + nest_pol = ys->req_policy; + if (tb[NLMSGERR_ATTR_MISS_NEST]) { + off = mnl_attr_get_u32(tb[NLMSGERR_ATTR_MISS_NEST]); + off -= sizeof(struct nlmsghdr); + off -= sizeof(struct genlmsghdr); + + n += ynl_err_walk(ys, start, end, off, ys->req_policy, + &miss_attr[n], sizeof(miss_attr) - n, + &nest_pol); + } + + n2 = 0; + ynl_err_walk_report_one(nest_pol, type, &miss_attr[n], + sizeof(miss_attr) - n, &n2); + n += n2; + + if (n >= sizeof(miss_attr)) + n = sizeof(miss_attr) - 1; + miss_attr[n] = '\0'; + } + + /* Implicitly depend on ys->err.code already set */ + if (str) + yerr_msg(ys, "Kernel %s: '%s'%s%s%s", + ys->err.code ? "error" : "warning", + str, bad_attr, miss_attr, + bad_attr[0] || miss_attr[0] ? ")" : ""); + else if (bad_attr[0] || miss_attr[0]) + yerr_msg(ys, "Kernel %s: %s%s", + ys->err.code ? "error" : "warning", + bad_attr, miss_attr); + + return MNL_CB_OK; +} + +static int ynl_cb_error(const struct nlmsghdr *nlh, void *data) +{ + const struct nlmsgerr *err = mnl_nlmsg_get_payload(nlh); + struct ynl_parse_arg *yarg = data; + unsigned int hlen; + int code; + + code = err->error >= 0 ? err->error : -err->error; + yarg->ys->err.code = code; + errno = code; + + hlen = sizeof(*err); + if (!(nlh->nlmsg_flags & NLM_F_CAPPED)) + hlen += mnl_nlmsg_get_payload_len(&err->msg); + + ynl_ext_ack_check(yarg->ys, nlh, hlen); + + return code ? MNL_CB_ERROR : MNL_CB_STOP; +} + +static int ynl_cb_done(const struct nlmsghdr *nlh, void *data) +{ + struct ynl_parse_arg *yarg = data; + int err; + + err = *(int *)NLMSG_DATA(nlh); + if (err < 0) { + yarg->ys->err.code = -err; + errno = -err; + + ynl_ext_ack_check(yarg->ys, nlh, sizeof(int)); + + return MNL_CB_ERROR; + } + return MNL_CB_STOP; +} + +static int ynl_cb_noop(const struct nlmsghdr *nlh, void *data) +{ + return MNL_CB_OK; +} + +mnl_cb_t ynl_cb_array[NLMSG_MIN_TYPE] = { + [NLMSG_NOOP] = ynl_cb_noop, + [NLMSG_ERROR] = ynl_cb_error, + [NLMSG_DONE] = ynl_cb_done, + [NLMSG_OVERRUN] = ynl_cb_noop, +}; + +/* Attribute validation */ + +int ynl_attr_validate(struct ynl_parse_arg *yarg, const struct nlattr *attr) +{ + struct ynl_policy_attr *policy; + unsigned int type, len; + unsigned char *data; + + data = mnl_attr_get_payload(attr); + len = mnl_attr_get_payload_len(attr); + type = mnl_attr_get_type(attr); + if (type > yarg->rsp_policy->max_attr) { + yerr(yarg->ys, YNL_ERROR_INTERNAL, + "Internal error, validating unknown attribute"); + return -1; + } + + policy = &yarg->rsp_policy->table[type]; + + switch (policy->type) { + case YNL_PT_REJECT: + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Rejected attribute (%s)", policy->name); + return -1; + case YNL_PT_IGNORE: + break; + case YNL_PT_U8: + if (len == sizeof(__u8)) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (u8 %s)", policy->name); + return -1; + case YNL_PT_U16: + if (len == sizeof(__u16)) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (u16 %s)", policy->name); + return -1; + case YNL_PT_U32: + if (len == sizeof(__u32)) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (u32 %s)", policy->name); + return -1; + case YNL_PT_U64: + if (len == sizeof(__u64)) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (u64 %s)", policy->name); + return -1; + case YNL_PT_FLAG: + /* Let flags grow into real attrs, why not.. */ + break; + case YNL_PT_NEST: + if (!len || len >= sizeof(*attr)) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (nest %s)", policy->name); + return -1; + case YNL_PT_BINARY: + if (!policy->len || len == policy->len) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (binary %s)", policy->name); + return -1; + case YNL_PT_NUL_STR: + if ((!policy->len || len <= policy->len) && !data[len - 1]) + break; + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (string %s)", policy->name); + return -1; + default: + yerr(yarg->ys, YNL_ERROR_ATTR_INVALID, + "Invalid attribute (unknown %s)", policy->name); + return -1; + } + + return 0; +} + +/* Generic code */ + +static void ynl_err_reset(struct ynl_sock *ys) +{ + ys->err.code = 0; + ys->err.attr_offs = 0; + ys->err.msg[0] = 0; +} + +struct nlmsghdr *ynl_msg_start(struct ynl_sock *ys, __u32 id, __u16 flags) +{ + struct nlmsghdr *nlh; + + ynl_err_reset(ys); + + nlh = ys->nlh = mnl_nlmsg_put_header(ys->tx_buf); + nlh->nlmsg_type = id; + nlh->nlmsg_flags = flags; + nlh->nlmsg_seq = ++ys->seq; + + return nlh; +} + +struct nlmsghdr * +ynl_gemsg_start(struct ynl_sock *ys, __u32 id, __u16 flags, + __u8 cmd, __u8 version) +{ + struct genlmsghdr gehdr; + struct nlmsghdr *nlh; + void *data; + + nlh = ynl_msg_start(ys, id, flags); + + memset(&gehdr, 0, sizeof(gehdr)); + gehdr.cmd = cmd; + gehdr.version = version; + + data = mnl_nlmsg_put_extra_header(nlh, sizeof(gehdr)); + memcpy(data, &gehdr, sizeof(gehdr)); + + return nlh; +} + +void ynl_msg_start_req(struct ynl_sock *ys, __u32 id) +{ + ynl_msg_start(ys, id, NLM_F_REQUEST | NLM_F_ACK); +} + +void ynl_msg_start_dump(struct ynl_sock *ys, __u32 id) +{ + ynl_msg_start(ys, id, NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP); +} + +struct nlmsghdr * +ynl_gemsg_start_req(struct ynl_sock *ys, __u32 id, __u8 cmd, __u8 version) +{ + return ynl_gemsg_start(ys, id, NLM_F_REQUEST | NLM_F_ACK, cmd, version); +} + +struct nlmsghdr * +ynl_gemsg_start_dump(struct ynl_sock *ys, __u32 id, __u8 cmd, __u8 version) +{ + return ynl_gemsg_start(ys, id, NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP, + cmd, version); +} + +int ynl_recv_ack(struct ynl_sock *ys, int ret) +{ + if (!ret) { + yerr(ys, YNL_ERROR_EXPECT_ACK, + "Expecting an ACK but nothing received"); + return -1; + } + + ret = mnl_socket_recvfrom(ys->sock, ys->rx_buf, MNL_SOCKET_BUFFER_SIZE); + if (ret < 0) { + perr(ys, "Socket receive failed"); + return ret; + } + return mnl_cb_run(ys->rx_buf, ret, ys->seq, ys->portid, + ynl_cb_null, ys); +} + +int ynl_cb_null(const struct nlmsghdr *nlh, void *data) +{ + struct ynl_parse_arg *yarg = data; + + yerr(yarg->ys, YNL_ERROR_UNEXPECT_MSG, + "Received a message when none were expected"); + + return MNL_CB_ERROR; +} + +/* Init/fini and genetlink boiler plate */ +static int +ynl_get_family_info_mcast(struct ynl_sock *ys, const struct nlattr *mcasts) +{ + const struct nlattr *entry, *attr; + unsigned int i; + + mnl_attr_for_each_nested(attr, mcasts) + ys->n_mcast_groups++; + + if (!ys->n_mcast_groups) + return 0; + + ys->mcast_groups = calloc(ys->n_mcast_groups, + sizeof(*ys->mcast_groups)); + if (!ys->mcast_groups) + return MNL_CB_ERROR; + + i = 0; + mnl_attr_for_each_nested(entry, mcasts) { + mnl_attr_for_each_nested(attr, entry) { + if (mnl_attr_get_type(attr) == CTRL_ATTR_MCAST_GRP_ID) + ys->mcast_groups[i].id = mnl_attr_get_u32(attr); + if (mnl_attr_get_type(attr) == CTRL_ATTR_MCAST_GRP_NAME) { + strncpy(ys->mcast_groups[i].name, + mnl_attr_get_str(attr), + GENL_NAMSIZ - 1); + ys->mcast_groups[i].name[GENL_NAMSIZ - 1] = 0; + } + } + } + + return 0; +} + +static int ynl_get_family_info_cb(const struct nlmsghdr *nlh, void *data) +{ + struct ynl_parse_arg *yarg = data; + struct ynl_sock *ys = yarg->ys; + const struct nlattr *attr; + bool found_id = true; + + mnl_attr_for_each(attr, nlh, sizeof(struct genlmsghdr)) { + if (mnl_attr_get_type(attr) == CTRL_ATTR_MCAST_GROUPS) + if (ynl_get_family_info_mcast(ys, attr)) + return MNL_CB_ERROR; + + if (mnl_attr_get_type(attr) != CTRL_ATTR_FAMILY_ID) + continue; + + if (mnl_attr_get_payload_len(attr) != sizeof(__u16)) { + yerr(ys, YNL_ERROR_ATTR_INVALID, "Invalid family ID"); + return MNL_CB_ERROR; + } + + ys->family_id = mnl_attr_get_u16(attr); + found_id = true; + } + + if (!found_id) { + yerr(ys, YNL_ERROR_ATTR_MISSING, "Family ID missing"); + return MNL_CB_ERROR; + } + return MNL_CB_OK; +} + +static int ynl_sock_read_family(struct ynl_sock *ys, const char *family_name) +{ + struct ynl_parse_arg yarg = { .ys = ys, }; + struct nlmsghdr *nlh; + int err; + + nlh = ynl_gemsg_start_req(ys, GENL_ID_CTRL, CTRL_CMD_GETFAMILY, 1); + mnl_attr_put_strz(nlh, CTRL_ATTR_FAMILY_NAME, family_name); + + err = mnl_socket_sendto(ys->sock, nlh, nlh->nlmsg_len); + if (err < 0) { + perr(ys, "failed to request socket family info"); + return err; + } + + err = mnl_socket_recvfrom(ys->sock, ys->rx_buf, MNL_SOCKET_BUFFER_SIZE); + if (err <= 0) { + perr(ys, "failed to receive the socket family info"); + return err; + } + err = mnl_cb_run2(ys->rx_buf, err, ys->seq, ys->portid, + ynl_get_family_info_cb, &yarg, + ynl_cb_array, ARRAY_SIZE(ynl_cb_array)); + if (err < 0) { + free(ys->mcast_groups); + perr(ys, "failed to receive the socket family info - no such family?"); + return err; + } + + return ynl_recv_ack(ys, err); +} + +struct ynl_sock * +ynl_sock_create(const struct ynl_family *yf, struct ynl_error *yse) +{ + struct ynl_sock *ys; + int one = 1; + + ys = malloc(sizeof(*ys) + 2 * MNL_SOCKET_BUFFER_SIZE); + if (!ys) + return NULL; + memset(ys, 0, sizeof(*ys)); + + ys->family = yf; + ys->tx_buf = &ys->raw_buf[0]; + ys->rx_buf = &ys->raw_buf[MNL_SOCKET_BUFFER_SIZE]; + ys->ntf_last_next = &ys->ntf_first; + + ys->sock = mnl_socket_open(NETLINK_GENERIC); + if (!ys->sock) { + __perr(yse, "failed to create a netlink socket"); + goto err_free_sock; + } + + if (mnl_socket_setsockopt(ys->sock, NETLINK_CAP_ACK, + &one, sizeof(one))) { + __perr(yse, "failed to enable netlink ACK"); + goto err_close_sock; + } + if (mnl_socket_setsockopt(ys->sock, NETLINK_EXT_ACK, + &one, sizeof(one))) { + __perr(yse, "failed to enable netlink ext ACK"); + goto err_close_sock; + } + + ys->seq = random(); + ys->portid = mnl_socket_get_portid(ys->sock); + + if (ynl_sock_read_family(ys, yf->name)) { + if (yse) + memcpy(yse, &ys->err, sizeof(*yse)); + goto err_close_sock; + } + + return ys; + +err_close_sock: + mnl_socket_close(ys->sock); +err_free_sock: + free(ys); + return NULL; +} + +void ynl_sock_destroy(struct ynl_sock *ys) +{ + struct ynl_ntf_base_type *ntf; + + mnl_socket_close(ys->sock); + while ((ntf = ynl_ntf_dequeue(ys))) + ynl_ntf_free(ntf); + free(ys->mcast_groups); + free(ys); +} + +/* YNL multicast handling */ + +void ynl_ntf_free(struct ynl_ntf_base_type *ntf) +{ + ntf->free(ntf); +} + +int ynl_subscribe(struct ynl_sock *ys, const char *grp_name) +{ + unsigned int i; + int err; + + for (i = 0; i < ys->n_mcast_groups; i++) + if (!strcmp(ys->mcast_groups[i].name, grp_name)) + break; + if (i == ys->n_mcast_groups) { + yerr(ys, ENOENT, "Multicast group '%s' not found", grp_name); + return -1; + } + + err = mnl_socket_setsockopt(ys->sock, NETLINK_ADD_MEMBERSHIP, + &ys->mcast_groups[i].id, + sizeof(ys->mcast_groups[i].id)); + if (err < 0) { + perr(ys, "Subscribing to multicast group failed"); + return -1; + } + + return 0; +} + +int ynl_socket_get_fd(struct ynl_sock *ys) +{ + return mnl_socket_get_fd(ys->sock); +} + +struct ynl_ntf_base_type *ynl_ntf_dequeue(struct ynl_sock *ys) +{ + struct ynl_ntf_base_type *ntf; + + if (!ynl_has_ntf(ys)) + return NULL; + + ntf = ys->ntf_first; + ys->ntf_first = ntf->next; + if (ys->ntf_last_next == &ntf->next) + ys->ntf_last_next = &ys->ntf_first; + + return ntf; +} + +static int ynl_ntf_parse(struct ynl_sock *ys, const struct nlmsghdr *nlh) +{ + struct ynl_parse_arg yarg = { .ys = ys, }; + const struct ynl_ntf_info *info; + struct ynl_ntf_base_type *rsp; + struct genlmsghdr *gehdr; + int ret; + + gehdr = mnl_nlmsg_get_payload(nlh); + if (gehdr->cmd >= ys->family->ntf_info_size) + return MNL_CB_ERROR; + info = &ys->family->ntf_info[gehdr->cmd]; + if (!info->cb) + return MNL_CB_ERROR; + + rsp = calloc(1, info->alloc_sz); + rsp->free = info->free; + yarg.data = rsp->data; + yarg.rsp_policy = info->policy; + + ret = info->cb(nlh, &yarg); + if (ret <= MNL_CB_STOP) + goto err_free; + + rsp->family = nlh->nlmsg_type; + rsp->cmd = gehdr->cmd; + + *ys->ntf_last_next = rsp; + ys->ntf_last_next = &rsp->next; + + return MNL_CB_OK; + +err_free: + info->free(rsp); + return MNL_CB_ERROR; +} + +static int ynl_ntf_trampoline(const struct nlmsghdr *nlh, void *data) +{ + return ynl_ntf_parse((struct ynl_sock *)data, nlh); +} + +int ynl_ntf_check(struct ynl_sock *ys) +{ + ssize_t len; + int err; + + do { + /* libmnl doesn't let us pass flags to the recv to make + * it non-blocking so we need to poll() or peek() :| + */ + struct pollfd pfd = { }; + + pfd.fd = mnl_socket_get_fd(ys->sock); + pfd.events = POLLIN; + err = poll(&pfd, 1, 1); + if (err < 1) + return err; + + len = mnl_socket_recvfrom(ys->sock, ys->rx_buf, + MNL_SOCKET_BUFFER_SIZE); + if (len < 0) + return len; + + err = mnl_cb_run2(ys->rx_buf, len, ys->seq, ys->portid, + ynl_ntf_trampoline, ys, + ynl_cb_array, NLMSG_MIN_TYPE); + if (err < 0) + return err; + } while (err > 0); + + return 0; +} + +/* YNL specific helpers used by the auto-generated code */ + +struct ynl_dump_list_type *YNL_LIST_END = (void *)(0xb4d123); + +void ynl_error_unknown_notification(struct ynl_sock *ys, __u8 cmd) +{ + yerr(ys, YNL_ERROR_UNKNOWN_NTF, + "Unknown notification message type '%d'", cmd); +} + +int ynl_error_parse(struct ynl_parse_arg *yarg, const char *msg) +{ + yerr(yarg->ys, YNL_ERROR_INV_RESP, "Error parsing response: %s", msg); + return MNL_CB_ERROR; +} + +static int +ynl_check_alien(struct ynl_sock *ys, const struct nlmsghdr *nlh, __u32 rsp_cmd) +{ + struct genlmsghdr *gehdr; + + if (mnl_nlmsg_get_payload_len(nlh) < sizeof(*gehdr)) { + yerr(ys, YNL_ERROR_INV_RESP, + "Kernel responded with truncated message"); + return -1; + } + + gehdr = mnl_nlmsg_get_payload(nlh); + if (gehdr->cmd != rsp_cmd) + return ynl_ntf_parse(ys, nlh); + + return 0; +} + +static int ynl_req_trampoline(const struct nlmsghdr *nlh, void *data) +{ + struct ynl_req_state *yrs = data; + int ret; + + ret = ynl_check_alien(yrs->yarg.ys, nlh, yrs->rsp_cmd); + if (ret) + return ret < 0 ? MNL_CB_ERROR : MNL_CB_OK; + + return yrs->cb(nlh, &yrs->yarg); +} + +int ynl_exec(struct ynl_sock *ys, struct nlmsghdr *req_nlh, + struct ynl_req_state *yrs) +{ + ssize_t len; + int err; + + err = mnl_socket_sendto(ys->sock, req_nlh, req_nlh->nlmsg_len); + if (err < 0) + return err; + + do { + len = mnl_socket_recvfrom(ys->sock, ys->rx_buf, + MNL_SOCKET_BUFFER_SIZE); + if (len < 0) + return len; + + err = mnl_cb_run2(ys->rx_buf, len, ys->seq, ys->portid, + ynl_req_trampoline, yrs, + ynl_cb_array, NLMSG_MIN_TYPE); + if (err < 0) + return err; + } while (err > 0); + + return 0; +} + +static int ynl_dump_trampoline(const struct nlmsghdr *nlh, void *data) +{ + struct ynl_dump_state *ds = data; + struct ynl_dump_list_type *obj; + struct ynl_parse_arg yarg = {}; + int ret; + + ret = ynl_check_alien(ds->ys, nlh, ds->rsp_cmd); + if (ret) + return ret < 0 ? MNL_CB_ERROR : MNL_CB_OK; + + obj = calloc(1, ds->alloc_sz); + if (!obj) + return MNL_CB_ERROR; + + if (!ds->first) + ds->first = obj; + if (ds->last) + ds->last->next = obj; + ds->last = obj; + + yarg.ys = ds->ys; + yarg.rsp_policy = ds->rsp_policy; + yarg.data = &obj->data; + + return ds->cb(nlh, &yarg); +} + +static void *ynl_dump_end(struct ynl_dump_state *ds) +{ + if (!ds->first) + return YNL_LIST_END; + + ds->last->next = YNL_LIST_END; + return ds->first; +} + +int ynl_exec_dump(struct ynl_sock *ys, struct nlmsghdr *req_nlh, + struct ynl_dump_state *yds) +{ + ssize_t len; + int err; + + err = mnl_socket_sendto(ys->sock, req_nlh, req_nlh->nlmsg_len); + if (err < 0) + return err; + + do { + len = mnl_socket_recvfrom(ys->sock, ys->rx_buf, + MNL_SOCKET_BUFFER_SIZE); + if (len < 0) + goto err_close_list; + + err = mnl_cb_run2(ys->rx_buf, len, ys->seq, ys->portid, + ynl_dump_trampoline, yds, + ynl_cb_array, NLMSG_MIN_TYPE); + if (err < 0) + goto err_close_list; + } while (err > 0); + + yds->first = ynl_dump_end(yds); + return 0; + +err_close_list: + yds->first = ynl_dump_end(yds); + return -1; +} diff --git a/tools/net/ynl/lib/ynl.h b/tools/net/ynl/lib/ynl.h new file mode 100644 index 000000000000..9eafa3552c16 --- /dev/null +++ b/tools/net/ynl/lib/ynl.h @@ -0,0 +1,237 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +#ifndef __YNL_C_H +#define __YNL_C_H 1 + +#include +#include +#include +#include + +struct mnl_socket; +struct nlmsghdr; + +/* + * User facing code + */ + +struct ynl_ntf_base_type; +struct ynl_ntf_info; +struct ynl_sock; + +enum ynl_error_code { + YNL_ERROR_NONE = 0, + __YNL_ERRNO_END = 4096, + YNL_ERROR_INTERNAL, + YNL_ERROR_EXPECT_ACK, + YNL_ERROR_EXPECT_MSG, + YNL_ERROR_UNEXPECT_MSG, + YNL_ERROR_ATTR_MISSING, + YNL_ERROR_ATTR_INVALID, + YNL_ERROR_UNKNOWN_NTF, + YNL_ERROR_INV_RESP, +}; + +/** + * struct ynl_error - error encountered by YNL + * @code: errno (low values) or YNL error code (enum ynl_error_code) + * @attr_offs: offset of bad attribute (for very advanced users) + * @msg: error message + * + * Error information for when YNL operations fail. + * Users should interact with the err member of struct ynl_sock directly. + * The main exception to that rule is ynl_sock_create(). + */ +struct ynl_error { + enum ynl_error_code code; + unsigned int attr_offs; + char msg[512]; +}; + +/** + * struct ynl_family - YNL family info + * Family description generated by codegen. Pass to ynl_sock_create(). + */ +struct ynl_family { +/* private: */ + const char *name; + const struct ynl_ntf_info *ntf_info; + unsigned int ntf_info_size; +}; + +/** + * struct ynl_sock - YNL wrapped netlink socket + * @err: YNL error descriptor, cleared on every request. + */ +struct ynl_sock { + struct ynl_error err; + +/* private: */ + const struct ynl_family *family; + struct mnl_socket *sock; + __u32 seq; + __u32 portid; + __u16 family_id; + + unsigned int n_mcast_groups; + struct { + unsigned int id; + char name[GENL_NAMSIZ]; + } *mcast_groups; + + struct ynl_ntf_base_type *ntf_first; + struct ynl_ntf_base_type **ntf_last_next; + + struct nlmsghdr *nlh; + struct ynl_policy_nest *req_policy; + unsigned char *tx_buf; + unsigned char *rx_buf; + unsigned char raw_buf[]; +}; + +struct ynl_sock * +ynl_sock_create(const struct ynl_family *yf, struct ynl_error *e); +void ynl_sock_destroy(struct ynl_sock *ys); + +#define ynl_dump_foreach(dump, iter) \ + for (typeof(dump->obj) *iter = &dump->obj; \ + !ynl_dump_obj_is_last(iter); \ + iter = ynl_dump_obj_next(iter)) + +int ynl_subscribe(struct ynl_sock *ys, const char *grp_name); +int ynl_socket_get_fd(struct ynl_sock *ys); +int ynl_ntf_check(struct ynl_sock *ys); + +/** + * ynl_has_ntf() - check if socket has *parsed* notifications + * @ys: active YNL socket + * + * Note that this does not take into account notifications sitting + * in netlink socket, just the notifications which have already been + * read and parsed (e.g. during a ynl_ntf_check() call). + */ +static inline bool ynl_has_ntf(struct ynl_sock *ys) +{ + return ys->ntf_last_next != &ys->ntf_first; +} +struct ynl_ntf_base_type *ynl_ntf_dequeue(struct ynl_sock *ys); + +void ynl_ntf_free(struct ynl_ntf_base_type *ntf); + +/* + * YNL internals / low level stuff + */ + +/* Generic mnl helper code */ + +enum ynl_policy_type { + YNL_PT_REJECT = 1, + YNL_PT_IGNORE, + YNL_PT_NEST, + YNL_PT_FLAG, + YNL_PT_BINARY, + YNL_PT_U8, + YNL_PT_U16, + YNL_PT_U32, + YNL_PT_U64, + YNL_PT_NUL_STR, +}; + +struct ynl_policy_attr { + enum ynl_policy_type type; + unsigned int len; + const char *name; + struct ynl_policy_nest *nest; +}; + +struct ynl_policy_nest { + unsigned int max_attr; + struct ynl_policy_attr *table; +}; + +struct ynl_parse_arg { + struct ynl_sock *ys; + struct ynl_policy_nest *rsp_policy; + void *data; +}; + +struct ynl_dump_list_type { + struct ynl_dump_list_type *next; + unsigned char data[] __attribute__ ((aligned (8))); +}; +extern struct ynl_dump_list_type *YNL_LIST_END; + +static inline bool ynl_dump_obj_is_last(void *obj) +{ + unsigned long uptr = (unsigned long)obj; + + uptr -= offsetof(struct ynl_dump_list_type, data); + return uptr == (unsigned long)YNL_LIST_END; +} + +static inline void *ynl_dump_obj_next(void *obj) +{ + unsigned long uptr = (unsigned long)obj; + struct ynl_dump_list_type *list; + + uptr -= offsetof(struct ynl_dump_list_type, data); + list = (void *)uptr; + uptr = (unsigned long)list->next; + uptr += offsetof(struct ynl_dump_list_type, data); + + return (void *)uptr; +} + +struct ynl_ntf_base_type { + __u16 family; + __u8 cmd; + struct ynl_ntf_base_type *next; + void (*free)(struct ynl_ntf_base_type *ntf); + unsigned char data[] __attribute__ ((aligned (8))); +}; + +extern mnl_cb_t ynl_cb_array[NLMSG_MIN_TYPE]; + +struct nlmsghdr * +ynl_gemsg_start_req(struct ynl_sock *ys, __u32 id, __u8 cmd, __u8 version); +struct nlmsghdr * +ynl_gemsg_start_dump(struct ynl_sock *ys, __u32 id, __u8 cmd, __u8 version); + +int ynl_attr_validate(struct ynl_parse_arg *yarg, const struct nlattr *attr); + +int ynl_recv_ack(struct ynl_sock *ys, int ret); +int ynl_cb_null(const struct nlmsghdr *nlh, void *data); + +/* YNL specific helpers used by the auto-generated code */ + +struct ynl_req_state { + struct ynl_parse_arg yarg; + mnl_cb_t cb; + __u32 rsp_cmd; +}; + +struct ynl_dump_state { + struct ynl_sock *ys; + struct ynl_policy_nest *rsp_policy; + void *first; + struct ynl_dump_list_type *last; + size_t alloc_sz; + mnl_cb_t cb; + __u32 rsp_cmd; +}; + +struct ynl_ntf_info { + struct ynl_policy_nest *policy; + mnl_cb_t cb; + size_t alloc_sz; + void (*free)(struct ynl_ntf_base_type *ntf); +}; + +int ynl_exec(struct ynl_sock *ys, struct nlmsghdr *req_nlh, + struct ynl_req_state *yrs); +int ynl_exec_dump(struct ynl_sock *ys, struct nlmsghdr *req_nlh, + struct ynl_dump_state *yds); + +void ynl_error_unknown_notification(struct ynl_sock *ys, __u8 cmd); +int ynl_error_parse(struct ynl_parse_arg *yarg, const char *msg); + +#endif diff --git a/tools/net/ynl/ynl-regen.sh b/tools/net/ynl/ynl-regen.sh index 74f5de1c2399..2a4525e2aa17 100755 --- a/tools/net/ynl/ynl-regen.sh +++ b/tools/net/ynl/ynl-regen.sh @@ -14,7 +14,7 @@ done KDIR=$(dirname $(dirname $(dirname $(dirname $(realpath $0))))) -files=$(git grep --files-with-matches '^/\* YNL-GEN \(kernel\|uapi\)') +files=$(git grep --files-with-matches '^/\* YNL-GEN \(kernel\|uapi\|user\)') for f in $files; do # params: 0 1 2 3 # $YAML YNL-GEN kernel $mode -- cgit v1.2.3 From 350b7a258f20427a411a888e5af0684327d49e3a Mon Sep 17 00:00:00 2001 From: Detlev Casanova Date: Mon, 5 Jun 2023 11:40:09 -0400 Subject: dt-bindings: net: phy: Document support for external PHY clk Ethern PHYs can have external an clock that needs to be activated before communicating with the PHY. Acked-by: Krzysztof Kozlowski Signed-off-by: Detlev Casanova Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/ethernet-phy.yaml | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml index 4f574532ee13..c1241c8a3b77 100644 --- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml @@ -93,6 +93,12 @@ properties: the turn around line low at end of the control phase of the MDIO transaction. + clocks: + maxItems: 1 + description: + External clock connected to the PHY. If not specified it is assumed + that the PHY uses a fixed crystal or an internal oscillator. + enet-phy-lane-swap: $ref: /schemas/types.yaml#/definitions/flag description: -- cgit v1.2.3 From a33682e4e78e249155abbe5e8ee880d5760b5e28 Mon Sep 17 00:00:00 2001 From: Lama Kayal Date: Tue, 6 Jun 2023 00:12:14 -0700 Subject: net/mlx5e: Expose catastrophic steering error counters Add generated_pkt_steering_fail and handled_pkt_steering_fail to devlink heatlth reporter. generated_pkt_steering_fail indicates the number of packets dropped due to illegal steering operation within the vport steering domain. handled_pkt_steering_fail indicates the number of packets dropped due to illegal steering operation, originated by the vport. Also, update devlink reporter functionality documentation with the newly exposed counters. Signed-off-by: Lama Kayal Reviewed-by: Rahul Rameshbabu Signed-off-by: Saeed Mahameed --- .../device_drivers/ethernet/mellanox/mlx5/devlink.rst | 7 +++++++ drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c | 10 ++++++++++ include/linux/mlx5/mlx5_ifc.h | 12 ++++++++++-- 3 files changed, 27 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst index 3354ca3608ee..a4edf908b707 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst @@ -290,6 +290,13 @@ Description of the vnic counters: - nic_receive_steering_discard number of packets that completed RX flow steering but were discarded due to a mismatch in flow table. +- generated_pkt_steering_fail + number of packets generated by the VNIC experiencing unexpected steering + failure (at any point in steering flow). +- handled_pkt_steering_fail + number of packets handled by the VNIC experiencing unexpected steering + failure (at any point in steering flow owned by the VNIC, including the FDB + for the eswitch owner). User commands examples: diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c index 9114661cd967..b0128336ff01 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c @@ -76,6 +76,16 @@ int mlx5_reporter_vnic_diagnose_counters(struct mlx5_core_dev *dev, if (err) return err; + err = devlink_fmsg_u64_pair_put(fmsg, "generated_pkt_steering_fail", + VNIC_ENV_GET64(&vnic, generated_pkt_steering_fail)); + if (err) + return err; + + err = devlink_fmsg_u64_pair_put(fmsg, "handled_pkt_steering_fail", + VNIC_ENV_GET64(&vnic, handled_pkt_steering_fail)); + if (err) + return err; + err = devlink_fmsg_obj_nest_end(fmsg); if (err) return err; diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index b89778d0d326..af3a92ad2e6b 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -1755,7 +1755,9 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 reserved_at_328[0x2]; u8 relaxed_ordering_read[0x1]; u8 log_max_pd[0x5]; - u8 reserved_at_330[0x9]; + u8 reserved_at_330[0x7]; + u8 vnic_env_cnt_steering_fail[0x1]; + u8 reserved_at_338[0x1]; u8 q_counter_aggregation[0x1]; u8 q_counter_other_vport[0x1]; u8 log_max_xrcd[0x5]; @@ -3673,7 +3675,13 @@ struct mlx5_ifc_vnic_diagnostic_statistics_bits { u8 eth_wqe_too_small[0x20]; - u8 reserved_at_220[0xdc0]; + u8 reserved_at_220[0xc0]; + + u8 generated_pkt_steering_fail[0x40]; + + u8 handled_pkt_steering_fail[0x40]; + + u8 reserved_at_360[0xc80]; }; struct mlx5_ifc_traffic_counter_bits { -- cgit v1.2.3 From 8947e503737138ff92323f99637d921451fe398a Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Wed, 7 Jun 2023 13:23:53 -0700 Subject: netlink: specs: devlink: fill in some details important for C Python YNL is much more forgiving than the C code gen in terms of the spec completeness. Fill in a handful of devlink details to make the spec usable in C. Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/devlink.yaml | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml index 90641668232e..5d46ca966979 100644 --- a/Documentation/netlink/specs/devlink.yaml +++ b/Documentation/netlink/specs/devlink.yaml @@ -9,6 +9,7 @@ doc: Partial family for Devlink. attribute-sets: - name: devlink + name-prefix: devlink-attr- attributes: - name: bus-name @@ -95,10 +96,12 @@ attribute-sets: - name: reload-action-info type: nest + multi-attr: true nested-attributes: dl-reload-act-info - name: reload-action-stats type: nest + multi-attr: true nested-attributes: dl-reload-act-stats - name: dl-dev-stats @@ -196,3 +199,8 @@ operations: attributes: - bus-name - dev-name + - info-driver-name + - info-serial-number + - info-version-fixed + - info-version-running + - info-version-stored -- cgit v1.2.3 From e71383fb9cd15a28d6c01d2c165a96f1c0bcf418 Mon Sep 17 00:00:00 2001 From: Shay Drory Date: Wed, 3 May 2023 14:18:23 +0300 Subject: net/mlx5: Light probe local SFs In case user wants to configure the SFs, for example: to use only vdpa functionality, he needs to fully probe a SF, configure what he wants, and afterward reload the SF. In order to save the time of the reload, local SFs will probe without any auxiliary sub-device, so that the SFs can be configured prior to its full probe. The defaults of the enable_* devlink params of these SFs are set to false. Usage example: Create SF: $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 $ devlink port function set pci/0000:08:00.0/32768 \ hw_addr 00:00:00:00:00:11 state active Enable ETH auxiliary device: $ devlink dev param set auxiliary/mlx5_core.sf.1 \ name enable_eth value true cmode driverinit Now, in order to fully probe the SF, use devlink reload: $ devlink dev reload auxiliary/mlx5_core.sf.1 At this point the user have SF devlink instance with auxiliary device for the Ethernet functionality only. Signed-off-by: Shay Drory Reviewed-by: Moshe Shemesh Signed-off-by: Saeed Mahameed --- .../ethernet/mellanox/mlx5/switchdev.rst | 20 ++++ drivers/net/ethernet/mellanox/mlx5/core/dev.c | 16 +++ drivers/net/ethernet/mellanox/mlx5/core/devlink.c | 20 +++- drivers/net/ethernet/mellanox/mlx5/core/health.c | 24 ++-- drivers/net/ethernet/mellanox/mlx5/core/main.c | 124 +++++++++++++++++++-- .../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 7 ++ .../ethernet/mellanox/mlx5/core/sf/dev/driver.c | 15 ++- 7 files changed, 203 insertions(+), 23 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst index 01deedb71597..db62187eebce 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -45,6 +45,26 @@ Following bridge VLAN functions are supported by mlx5: Subfunction =========== +Subfunction which are spawned over the E-switch are created only with devlink +device, and by default all the SF auxiliary devices are disabled. +This will allow user to configure the SF before the SF have been fully probed, +which will save time. + +Usage example: +Create SF: +$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 +$ devlink port function set pci/0000:08:00.0/32768 \ + hw_addr 00:00:00:00:00:11 state active + +Enable ETH auxiliary device: +$ devlink dev param set auxiliary/mlx5_core.sf.1 \ + name enable_eth value true cmode driverinit + +Now, in order to fully probe the SF, use devlink reload: +$ devlink dev reload auxiliary/mlx5_core.sf.1 + +mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst`) + mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst `) interface. A subfunction has its own function capabilities and its own resources. This diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/dev.c index 1b33533b15de..617ac7e5d75c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c @@ -323,6 +323,18 @@ static void del_adev(struct auxiliary_device *adev) auxiliary_device_uninit(adev); } +void mlx5_dev_set_lightweight(struct mlx5_core_dev *dev) +{ + mutex_lock(&mlx5_intf_mutex); + dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV; + mutex_unlock(&mlx5_intf_mutex); +} + +bool mlx5_dev_is_lightweight(struct mlx5_core_dev *dev) +{ + return dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV; +} + int mlx5_attach_device(struct mlx5_core_dev *dev) { struct mlx5_priv *priv = &dev->priv; @@ -457,6 +469,10 @@ static int add_drivers(struct mlx5_core_dev *dev) if (priv->adev[i]) continue; + if (mlx5_adev_devices[i].is_enabled && + !(mlx5_adev_devices[i].is_enabled(dev))) + continue; + if (mlx5_adev_devices[i].is_supported) is_supported = mlx5_adev_devices[i].is_supported(dev); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c index 27197acdb4d8..3d82ec890666 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c @@ -141,6 +141,13 @@ static int mlx5_devlink_reload_down(struct devlink *devlink, bool netns_change, bool sf_dev_allocated; int ret = 0; + if (mlx5_dev_is_lightweight(dev)) { + if (action != DEVLINK_RELOAD_ACTION_DRIVER_REINIT) + return -EOPNOTSUPP; + mlx5_unload_one_light(dev); + return 0; + } + sf_dev_allocated = mlx5_sf_dev_allocated(dev); if (sf_dev_allocated) { /* Reload results in deleting SF device which further results in @@ -193,6 +200,10 @@ static int mlx5_devlink_reload_up(struct devlink *devlink, enum devlink_reload_a *actions_performed = BIT(action); switch (action) { case DEVLINK_RELOAD_ACTION_DRIVER_REINIT: + if (mlx5_dev_is_lightweight(dev)) { + mlx5_fw_reporters_create(dev); + return mlx5_init_one_devl_locked(dev); + } ret = mlx5_load_one_devl_locked(dev, false); break; case DEVLINK_RELOAD_ACTION_FW_ACTIVATE: @@ -511,7 +522,7 @@ static void mlx5_devlink_set_params_init_values(struct devlink *devlink) struct mlx5_core_dev *dev = devlink_priv(devlink); union devlink_param_value value; - value.vbool = MLX5_CAP_GEN(dev, roce); + value.vbool = MLX5_CAP_GEN(dev, roce) && !mlx5_dev_is_lightweight(dev); devl_param_driverinit_value_set(devlink, DEVLINK_PARAM_GENERIC_ID_ENABLE_ROCE, value); @@ -561,7 +572,7 @@ static int mlx5_devlink_eth_params_register(struct devlink *devlink) if (err) return err; - value.vbool = true; + value.vbool = !mlx5_dev_is_lightweight(dev); devl_param_driverinit_value_set(devlink, DEVLINK_PARAM_GENERIC_ID_ENABLE_ETH, value); @@ -601,6 +612,7 @@ static const struct devlink_param mlx5_devlink_rdma_params[] = { static int mlx5_devlink_rdma_params_register(struct devlink *devlink) { + struct mlx5_core_dev *dev = devlink_priv(devlink); union devlink_param_value value; int err; @@ -612,7 +624,7 @@ static int mlx5_devlink_rdma_params_register(struct devlink *devlink) if (err) return err; - value.vbool = true; + value.vbool = !mlx5_dev_is_lightweight(dev); devl_param_driverinit_value_set(devlink, DEVLINK_PARAM_GENERIC_ID_ENABLE_RDMA, value); @@ -647,7 +659,7 @@ static int mlx5_devlink_vnet_params_register(struct devlink *devlink) if (err) return err; - value.vbool = true; + value.vbool = !mlx5_dev_is_lightweight(dev); devl_param_driverinit_value_set(devlink, DEVLINK_PARAM_GENERIC_ID_ENABLE_VNET, value); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c index 871c32dda66e..210100a4064a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c @@ -719,7 +719,7 @@ static const struct devlink_health_reporter_ops mlx5_fw_fatal_reporter_ops = { #define MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD 30000 #define MLX5_FW_REPORTER_DEFAULT_GRACEFUL_PERIOD MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD -static void mlx5_fw_reporters_create(struct mlx5_core_dev *dev) +void mlx5_fw_reporters_create(struct mlx5_core_dev *dev) { struct mlx5_core_health *health = &dev->priv.health; struct devlink *devlink = priv_to_devlink(dev); @@ -735,17 +735,17 @@ static void mlx5_fw_reporters_create(struct mlx5_core_dev *dev) } health->fw_reporter = - devlink_health_reporter_create(devlink, &mlx5_fw_reporter_ops, - 0, dev); + devl_health_reporter_create(devlink, &mlx5_fw_reporter_ops, + 0, dev); if (IS_ERR(health->fw_reporter)) mlx5_core_warn(dev, "Failed to create fw reporter, err = %ld\n", PTR_ERR(health->fw_reporter)); health->fw_fatal_reporter = - devlink_health_reporter_create(devlink, - &mlx5_fw_fatal_reporter_ops, - grace_period, - dev); + devl_health_reporter_create(devlink, + &mlx5_fw_fatal_reporter_ops, + grace_period, + dev); if (IS_ERR(health->fw_fatal_reporter)) mlx5_core_warn(dev, "Failed to create fw fatal reporter, err = %ld\n", PTR_ERR(health->fw_fatal_reporter)); @@ -777,7 +777,8 @@ void mlx5_trigger_health_work(struct mlx5_core_dev *dev) { struct mlx5_core_health *health = &dev->priv.health; - queue_work(health->wq, &health->fatal_report_work); + if (!mlx5_dev_is_lightweight(dev)) + queue_work(health->wq, &health->fatal_report_work); } #define MLX5_MSEC_PER_HOUR (MSEC_PER_SEC * 60 * 60) @@ -905,10 +906,15 @@ void mlx5_health_cleanup(struct mlx5_core_dev *dev) int mlx5_health_init(struct mlx5_core_dev *dev) { + struct devlink *devlink = priv_to_devlink(dev); struct mlx5_core_health *health; char *name; - mlx5_fw_reporters_create(dev); + if (!mlx5_dev_is_lightweight(dev)) { + devl_lock(devlink); + mlx5_fw_reporters_create(dev); + devl_unlock(devlink); + } mlx5_reporter_vnic_create(dev); health = &dev->priv.health; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 0faae77d84e6..6fa314f8e5ee 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1424,12 +1424,11 @@ static void mlx5_unload(struct mlx5_core_dev *dev) mlx5_put_uars_page(dev, dev->priv.uar); } -int mlx5_init_one(struct mlx5_core_dev *dev) +int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev) { - struct devlink *devlink = priv_to_devlink(dev); + bool light_probe = mlx5_dev_is_lightweight(dev); int err = 0; - devl_lock(devlink); mutex_lock(&dev->intf_state_mutex); dev->state = MLX5_DEVICE_STATE_UP; @@ -1443,9 +1442,14 @@ int mlx5_init_one(struct mlx5_core_dev *dev) goto function_teardown; } - err = mlx5_devlink_params_register(priv_to_devlink(dev)); - if (err) - goto err_devlink_params_reg; + /* In case of light_probe, mlx5_devlink is already registered. + * Hence, don't register devlink again. + */ + if (!light_probe) { + err = mlx5_devlink_params_register(priv_to_devlink(dev)); + if (err) + goto err_devlink_params_reg; + } err = mlx5_load(dev); if (err) @@ -1458,14 +1462,14 @@ int mlx5_init_one(struct mlx5_core_dev *dev) goto err_register; mutex_unlock(&dev->intf_state_mutex); - devl_unlock(devlink); return 0; err_register: clear_bit(MLX5_INTERFACE_STATE_UP, &dev->intf_state); mlx5_unload(dev); err_load: - mlx5_devlink_params_unregister(priv_to_devlink(dev)); + if (!light_probe) + mlx5_devlink_params_unregister(priv_to_devlink(dev)); err_devlink_params_reg: mlx5_cleanup_once(dev); function_teardown: @@ -1473,6 +1477,16 @@ function_teardown: err_function: dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; mutex_unlock(&dev->intf_state_mutex); + return err; +} + +int mlx5_init_one(struct mlx5_core_dev *dev) +{ + struct devlink *devlink = priv_to_devlink(dev); + int err; + + devl_lock(devlink); + err = mlx5_init_one_devl_locked(dev); devl_unlock(devlink); return err; } @@ -1590,6 +1604,100 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool suspend) devl_unlock(devlink); } +/* In case of light probe, we don't need a full query of hca_caps, but only the bellow caps. + * A full query of hca_caps will be done when the device will reload. + */ +static int mlx5_query_hca_caps_light(struct mlx5_core_dev *dev) +{ + int err; + + err = mlx5_core_get_caps(dev, MLX5_CAP_GENERAL); + if (err) + return err; + + if (MLX5_CAP_GEN(dev, eth_net_offloads)) { + err = mlx5_core_get_caps(dev, MLX5_CAP_ETHERNET_OFFLOADS); + if (err) + return err; + } + + if (MLX5_CAP_GEN(dev, nic_flow_table) || + MLX5_CAP_GEN(dev, ipoib_enhanced_offloads)) { + err = mlx5_core_get_caps(dev, MLX5_CAP_FLOW_TABLE); + if (err) + return err; + } + + if (MLX5_CAP_GEN_64(dev, general_obj_types) & + MLX5_GENERAL_OBJ_TYPES_CAP_VIRTIO_NET_Q) { + err = mlx5_core_get_caps(dev, MLX5_CAP_VDPA_EMULATION); + if (err) + return err; + } + + return 0; +} + +int mlx5_init_one_light(struct mlx5_core_dev *dev) +{ + struct devlink *devlink = priv_to_devlink(dev); + int err; + + dev->state = MLX5_DEVICE_STATE_UP; + err = mlx5_function_enable(dev, true, mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT)); + if (err) { + mlx5_core_warn(dev, "mlx5_function_enable err=%d\n", err); + goto out; + } + + err = mlx5_query_hca_caps_light(dev); + if (err) { + mlx5_core_warn(dev, "mlx5_query_hca_caps_light err=%d\n", err); + goto query_hca_caps_err; + } + + devl_lock(devlink); + err = mlx5_devlink_params_register(priv_to_devlink(dev)); + devl_unlock(devlink); + if (err) { + mlx5_core_warn(dev, "mlx5_devlink_param_reg err = %d\n", err); + goto query_hca_caps_err; + } + + return 0; + +query_hca_caps_err: + mlx5_function_disable(dev, true); +out: + dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; + return err; +} + +void mlx5_uninit_one_light(struct mlx5_core_dev *dev) +{ + struct devlink *devlink = priv_to_devlink(dev); + + devl_lock(devlink); + mlx5_devlink_params_unregister(priv_to_devlink(dev)); + devl_unlock(devlink); + if (dev->state != MLX5_DEVICE_STATE_UP) + return; + mlx5_function_disable(dev, true); +} + +/* xxx_light() function are used in order to configure the device without full + * init (light init). e.g.: There isn't a point in reload a device to light state. + * Hence, mlx5_load_one_light() isn't needed. + */ + +void mlx5_unload_one_light(struct mlx5_core_dev *dev) +{ + if (dev->state != MLX5_DEVICE_STATE_UP) + return; + mlx5_function_disable(dev, false); + dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; +} + static const int types[] = { MLX5_CAP_GENERAL, MLX5_CAP_GENERAL_2, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h index 7a5f04082058..464c6885a01c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h @@ -240,11 +240,14 @@ int mlx5_attach_device(struct mlx5_core_dev *dev); void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend); int mlx5_register_device(struct mlx5_core_dev *dev); void mlx5_unregister_device(struct mlx5_core_dev *dev); +void mlx5_dev_set_lightweight(struct mlx5_core_dev *dev); +bool mlx5_dev_is_lightweight(struct mlx5_core_dev *dev); struct mlx5_core_dev *mlx5_get_next_phys_dev_lag(struct mlx5_core_dev *dev); void mlx5_dev_list_lock(void); void mlx5_dev_list_unlock(void); int mlx5_dev_list_trylock(void); +void mlx5_fw_reporters_create(struct mlx5_core_dev *dev); int mlx5_query_mtpps(struct mlx5_core_dev *dev, u32 *mtpps, u32 mtpps_size); int mlx5_set_mtpps(struct mlx5_core_dev *mdev, u32 *mtpps, u32 mtpps_size); int mlx5_query_mtppse(struct mlx5_core_dev *mdev, u8 pin, u8 *arm, u8 *mode); @@ -319,11 +322,15 @@ static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev) int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx); void mlx5_mdev_uninit(struct mlx5_core_dev *dev); int mlx5_init_one(struct mlx5_core_dev *dev); +int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev); void mlx5_uninit_one(struct mlx5_core_dev *dev); void mlx5_unload_one(struct mlx5_core_dev *dev, bool suspend); void mlx5_unload_one_devl_locked(struct mlx5_core_dev *dev, bool suspend); int mlx5_load_one(struct mlx5_core_dev *dev, bool recovery); int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery); +int mlx5_init_one_light(struct mlx5_core_dev *dev); +void mlx5_uninit_one_light(struct mlx5_core_dev *dev); +void mlx5_unload_one_light(struct mlx5_core_dev *dev); int mlx5_vport_set_other_func_cap(struct mlx5_core_dev *dev, const void *hca_cap, u16 vport, u16 opmod); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c index 0692363cf80e..8fe82f1191bb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c @@ -3,6 +3,7 @@ #include #include +#include #include "mlx5_core.h" #include "dev.h" #include "devlink.h" @@ -28,6 +29,10 @@ static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxilia mdev->priv.adev_idx = adev->id; sf_dev->mdev = mdev; + /* Only local SFs do light probe */ + if (MLX5_ESWITCH_MANAGER(sf_dev->parent_mdev)) + mlx5_dev_set_lightweight(mdev); + err = mlx5_mdev_init(mdev, MLX5_SF_PROF); if (err) { mlx5_core_warn(mdev, "mlx5_mdev_init on err=%d\n", err); @@ -41,7 +46,10 @@ static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxilia goto remap_err; } - err = mlx5_init_one(mdev); + if (MLX5_ESWITCH_MANAGER(sf_dev->parent_mdev)) + err = mlx5_init_one_light(mdev); + else + err = mlx5_init_one(mdev); if (err) { mlx5_core_warn(mdev, "mlx5_init_one err=%d\n", err); goto init_one_err; @@ -65,7 +73,10 @@ static void mlx5_sf_dev_remove(struct auxiliary_device *adev) mlx5_drain_health_wq(sf_dev->mdev); devlink_unregister(devlink); - mlx5_uninit_one(sf_dev->mdev); + if (mlx5_dev_is_lightweight(sf_dev->mdev)) + mlx5_uninit_one_light(sf_dev->mdev); + else + mlx5_uninit_one(sf_dev->mdev); iounmap(sf_dev->mdev->iseg); mlx5_mdev_uninit(sf_dev->mdev); mlx5_devlink_free(devlink); -- cgit v1.2.3 From cbb1ca6d5f9a5a4972c4466a4b61e5bed1f4690f Mon Sep 17 00:00:00 2001 From: Radhey Shyam Pandey Date: Thu, 8 Jun 2023 13:54:58 +0530 Subject: dt-bindings: net: xlnx,axi-ethernet: convert bindings document to yaml Convert the bindings document for Xilinx AXI Ethernet Subsystem from txt to yaml. No changes to existing binding description. Signed-off-by: Radhey Shyam Pandey Signed-off-by: Sarath Babu Naidu Gaddam Signed-off-by: David S. Miller --- .../devicetree/bindings/net/xilinx_axienet.txt | 101 ------------ .../devicetree/bindings/net/xlnx,axi-ethernet.yaml | 183 +++++++++++++++++++++ MAINTAINERS | 1 + 3 files changed, 184 insertions(+), 101 deletions(-) delete mode 100644 Documentation/devicetree/bindings/net/xilinx_axienet.txt create mode 100644 Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/xilinx_axienet.txt b/Documentation/devicetree/bindings/net/xilinx_axienet.txt deleted file mode 100644 index 80e505a2fda1..000000000000 --- a/Documentation/devicetree/bindings/net/xilinx_axienet.txt +++ /dev/null @@ -1,101 +0,0 @@ -XILINX AXI ETHERNET Device Tree Bindings --------------------------------------------------------- - -Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core -provides connectivity to an external ethernet PHY supporting different -interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two -segments of memory for buffering TX and RX, as well as the capability of -offloading TX/RX checksum calculation off the processor. - -Management configuration is done through the AXI interface, while payload is -sent and received through means of an AXI DMA controller. This driver -includes the DMA driver code, so this driver is incompatible with AXI DMA -driver. - -For more details about mdio please refer phy.txt file in the same directory. - -Required properties: -- compatible : Must be one of "xlnx,axi-ethernet-1.00.a", - "xlnx,axi-ethernet-1.01.a", "xlnx,axi-ethernet-2.01.a" -- reg : Address and length of the IO space, as well as the address - and length of the AXI DMA controller IO space, unless - axistream-connected is specified, in which case the reg - attribute of the node referenced by it is used. -- interrupts : Should be a list of 2 or 3 interrupts: TX DMA, RX DMA, - and optionally Ethernet core. If axistream-connected is - specified, the TX/RX DMA interrupts should be on that node - instead, and only the Ethernet core interrupt is optionally - specified here. -- phy-handle : Should point to the external phy device if exists. Pointing - this to the PCS/PMA PHY is deprecated and should be avoided. - See ethernet.txt file in the same directory. -- xlnx,rxmem : Set to allocated memory buffer for Rx/Tx in the hardware - -Optional properties: -- phy-mode : See ethernet.txt -- xlnx,phy-type : Deprecated, do not use, but still accepted in preference - to phy-mode. -- xlnx,txcsum : 0 or empty for disabling TX checksum offload, - 1 to enable partial TX checksum offload, - 2 to enable full TX checksum offload -- xlnx,rxcsum : Same values as xlnx,txcsum but for RX checksum offload -- xlnx,switch-x-sgmii : Boolean to indicate the Ethernet core is configured to - support both 1000BaseX and SGMII modes. If set, the phy-mode - should be set to match the mode selected on core reset (i.e. - by the basex_or_sgmii core input line). -- clock-names: Tuple listing input clock names. Possible clocks: - s_axi_lite_clk: Clock for AXI register slave interface - axis_clk: AXI4-Stream clock for TXD RXD TXC and RXS interfaces - ref_clk: Ethernet reference clock, used by signal delay - primitives and transceivers - mgt_clk: MGT reference clock (used by optional internal - PCS/PMA PHY) - - Note that if s_axi_lite_clk is not specified by name, the - first clock of any name is used for this. If that is also not - specified, the clock rate is auto-detected from the CPU clock - (but only on platforms where this is possible). New device - trees should specify all applicable clocks by name - the - fallbacks to an unnamed clock or to CPU clock are only for - backward compatibility. -- clocks: Phandles to input clocks matching clock-names. Refer to common - clock bindings. -- axistream-connected: Reference to another node which contains the resources - for the AXI DMA controller used by this device. - If this is specified, the DMA-related resources from that - device (DMA registers and DMA TX/RX interrupts) rather - than this one will be used. - - mdio : Child node for MDIO bus. Must be defined if PHY access is - required through the core's MDIO interface (i.e. always, - unless the PHY is accessed through a different bus). - Non-standard MDIO bus frequency is supported via - "clock-frequency", see mdio.yaml. - - - pcs-handle: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X - modes, where "pcs-handle" should be used to point - to the PCS/PMA PHY, and "phy-handle" should point to an - external PHY if exists. - -Example: - axi_ethernet_eth: ethernet@40c00000 { - compatible = "xlnx,axi-ethernet-1.00.a"; - device_type = "network"; - interrupt-parent = <µblaze_0_axi_intc>; - interrupts = <2 0 1>; - clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; - clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; - phy-mode = "mii"; - reg = <0x40c00000 0x40000 0x50c00000 0x40000>; - xlnx,rxcsum = <0x2>; - xlnx,rxmem = <0x800>; - xlnx,txcsum = <0x2>; - phy-handle = <&phy0>; - axi_ethernetlite_0_mdio: mdio { - #address-cells = <1>; - #size-cells = <0>; - phy0: phy@0 { - device_type = "ethernet-phy"; - reg = <1>; - }; - }; - }; diff --git a/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml new file mode 100644 index 000000000000..1d33d80af11c --- /dev/null +++ b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml @@ -0,0 +1,183 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/xlnx,axi-ethernet.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: AXI 1G/2.5G Ethernet Subsystem + +description: | + Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core + provides connectivity to an external ethernet PHY supporting different + interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two + segments of memory for buffering TX and RX, as well as the capability of + offloading TX/RX checksum calculation off the processor. + + Management configuration is done through the AXI interface, while payload is + sent and received through means of an AXI DMA controller. This driver + includes the DMA driver code, so this driver is incompatible with AXI DMA + driver. + +maintainers: + - Radhey Shyam Pandey + +properties: + compatible: + enum: + - xlnx,axi-ethernet-1.00.a + - xlnx,axi-ethernet-1.01.a + - xlnx,axi-ethernet-2.01.a + + reg: + description: + Address and length of the IO space, as well as the address + and length of the AXI DMA controller IO space, unless + axistream-connected is specified, in which case the reg + attribute of the node referenced by it is used. + maxItems: 2 + + interrupts: + items: + - description: Ethernet core interrupt + - description: Tx DMA interrupt + - description: Rx DMA interrupt + description: + Ethernet core interrupt is optional. If axistream-connected property is + present DMA node should contains TX/RX DMA interrupts else DMA interrupt + resources are mentioned on ethernet node. + minItems: 1 + + phy-handle: true + + xlnx,rxmem: + description: + Set to allocated memory buffer for Rx/Tx in the hardware. + $ref: /schemas/types.yaml#/definitions/uint32 + + phy-mode: + enum: + - mii + - gmii + - rgmii + - sgmii + - 1000BaseX + + xlnx,phy-type: + description: + Do not use, but still accepted in preference to phy-mode. + deprecated: true + $ref: /schemas/types.yaml#/definitions/uint32 + + xlnx,txcsum: + description: + TX checksum offload. 0 or empty for disabling TX checksum offload, + 1 to enable partial TX checksum offload and 2 to enable full TX + checksum offload. + $ref: /schemas/types.yaml#/definitions/uint32 + enum: [0, 1, 2] + + xlnx,rxcsum: + description: + RX checksum offload. 0 or empty for disabling RX checksum offload, + 1 to enable partial RX checksum offload and 2 to enable full RX + checksum offload. + $ref: /schemas/types.yaml#/definitions/uint32 + enum: [0, 1, 2] + + xlnx,switch-x-sgmii: + type: boolean + description: + Indicate the Ethernet core is configured to support both 1000BaseX and + SGMII modes. If set, the phy-mode should be set to match the mode + selected on core reset (i.e. by the basex_or_sgmii core input line). + + clocks: + items: + - description: Clock for AXI register slave interface. + - description: AXI4-Stream clock for TXD RXD TXC and RXS interfaces. + - description: Ethernet reference clock, used by signal delay primitives + and transceivers. + - description: MGT reference clock (used by optional internal PCS/PMA PHY) + + clock-names: + items: + - const: s_axi_lite_clk + - const: axis_clk + - const: ref_clk + - const: mgt_clk + + axistream-connected: + $ref: /schemas/types.yaml#/definitions/phandle + description: Phandle of AXI DMA controller which contains the resources + used by this device. If this is specified, the DMA-related resources + from that device (DMA registers and DMA TX/RX interrupts) rather than + this one will be used. + + mdio: + type: object + + pcs-handle: + description: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X + modes, where "pcs-handle" should be used to point to the PCS/PMA PHY, + and "phy-handle" should point to an external PHY if exists. + maxItems: 1 + +required: + - compatible + - interrupts + - reg + - xlnx,rxmem + - phy-handle + +allOf: + - $ref: /schemas/net/ethernet-controller.yaml# + +additionalProperties: false + +examples: + - | + axi_ethernet_eth: ethernet@40c00000 { + compatible = "xlnx,axi-ethernet-1.00.a"; + interrupts = <2 0 1>; + clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; + clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; + phy-mode = "mii"; + reg = <0x40c00000 0x40000>,<0x50c00000 0x40000>; + xlnx,rxcsum = <0x2>; + xlnx,rxmem = <0x800>; + xlnx,txcsum = <0x2>; + phy-handle = <&phy0>; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy0: ethernet-phy@1 { + device_type = "ethernet-phy"; + reg = <1>; + }; + }; + }; + + - | + axi_ethernet_eth1: ethernet@40000000 { + compatible = "xlnx,axi-ethernet-1.00.a"; + interrupts = <0>; + clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; + clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; + phy-mode = "mii"; + reg = <0x00 0x40000000 0x00 0x40000>; + xlnx,rxcsum = <0x2>; + xlnx,rxmem = <0x800>; + xlnx,txcsum = <0x2>; + phy-handle = <&phy1>; + axistream-connected = <&dma>; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy1: ethernet-phy@1 { + device_type = "ethernet-phy"; + reg = <1>; + }; + }; + }; diff --git a/MAINTAINERS b/MAINTAINERS index 0971854323a7..ecac69046d30 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -23144,6 +23144,7 @@ F: drivers/iio/adc/xilinx-ams.c XILINX AXI ETHERNET DRIVER M: Radhey Shyam Pandey S: Maintained +F: Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml F: drivers/net/ethernet/xilinx/xilinx_axienet* XILINX CAN DRIVER -- cgit v1.2.3 From 61ab5a060a57d7aa77cfbc860c84a25d7d1ac3cf Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Fri, 9 Jun 2023 16:07:12 +0200 Subject: dt-bindings: net: drop unneeded quotes Cleanup bindings dropping unneeded quotes. Once all these are fixed, checking for this can be enabled in yamllint. Signed-off-by: Krzysztof Kozlowski Acked-by: Jernej Skrabec Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml | 2 +- Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml | 2 +- Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml | 2 +- Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml | 2 +- Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml | 2 +- Documentation/devicetree/bindings/net/mediatek-dwmac.yaml | 2 +- Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml | 2 +- Documentation/devicetree/bindings/net/rockchip-dwmac.yaml | 2 +- Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml | 4 ++-- Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml | 2 +- 10 files changed, 11 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml index 3bd912ed7c7e..23e92be33ac8 100644 --- a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml +++ b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Allwinner A20 GMAC allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# maintainers: - Chen-Yu Tsai diff --git a/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml b/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml index 47bc2057e629..4bfac9186886 100644 --- a/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml @@ -63,7 +63,7 @@ required: - syscon allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml index a2c51a84efa5..ee7a65b528cd 100644 --- a/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml @@ -27,7 +27,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml b/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml index 0e5e5db32faf..7c90a4390531 100644 --- a/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml +++ b/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml @@ -55,7 +55,7 @@ properties: patternProperties: "^mdio@[0-9a-f]+$": type: object - $ref: "brcm,unimac-mdio.yaml" + $ref: brcm,unimac-mdio.yaml description: GENET internal UniMAC MDIO bus diff --git a/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml b/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml index d23fa3771210..42a0bc94312c 100644 --- a/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml +++ b/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml @@ -19,7 +19,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml index 0fa2132fa4f4..08d74ca0769c 100644 --- a/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml @@ -25,7 +25,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml index 63409cbff5ad..4c01cae7c93a 100644 --- a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml +++ b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml @@ -24,7 +24,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml index 2a21bbe02892..176ea5f90251 100644 --- a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml @@ -32,7 +32,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml index 395a4650e285..c9c25132d154 100644 --- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml +++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml @@ -168,14 +168,14 @@ properties: patternProperties: "^mdio@[0-9a-f]+$": type: object - $ref: "ti,davinci-mdio.yaml#" + $ref: ti,davinci-mdio.yaml# description: CPSW MDIO bus. "^cpts@[0-9a-f]+": type: object - $ref: "ti,k3-am654-cpts.yaml#" + $ref: ti,k3-am654-cpts.yaml# description: CPSW Common Platform Time Sync (CPTS) module. diff --git a/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml b/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml index 474fa8bcf302..052f636158b3 100644 --- a/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml @@ -19,7 +19,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: -- cgit v1.2.3 From ed2042cc77f1cef4850a891dc93d80fb1aa6c955 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:37 -0700 Subject: netlink: specs: support setting prefix-name per attribute Ethtool's PSE PoDL has a attr nest with different prefixes: /* Power Sourcing Equipment */ enum { ETHTOOL_A_PSE_UNSPEC, ETHTOOL_A_PSE_HEADER, /* nest - _A_HEADER_* */ ETHTOOL_A_PODL_PSE_ADMIN_STATE, /* u32 */ ETHTOOL_A_PODL_PSE_ADMIN_CONTROL, /* u32 */ ETHTOOL_A_PODL_PSE_PW_D_STATUS, /* u32 */ Header has a prefix of ETHTOOL_A_PSE_ and other attrs prefix of ETHTOOL_A_PODL_PSE_ we can't cover them uniformly. If PODL was after PSE life would be easy. Now we either need to add prefixes to attr names which is yucky or support setting prefix name per attr. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/genetlink-c.yaml | 4 ++++ Documentation/netlink/genetlink-legacy.yaml | 4 ++++ tools/net/ynl/ynl-gen-c.py | 7 +++++-- 3 files changed, 13 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/genetlink-c.yaml b/Documentation/netlink/genetlink-c.yaml index 8e8c17b0a6c6..0519c257ecf4 100644 --- a/Documentation/netlink/genetlink-c.yaml +++ b/Documentation/netlink/genetlink-c.yaml @@ -195,6 +195,10 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + # Start genetlink-c + name-prefix: + type: string + # End genetlink-c # Make sure name-prefix does not appear in subsets (subsets inherit naming) dependencies: diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index ac4350498f5e..b474889b49ff 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -226,6 +226,10 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + # Start genetlink-c + name-prefix: + type: string + # End genetlink-c # Start genetlink-legacy struct: description: Name of the struct type used for the attribute. diff --git a/tools/net/ynl/ynl-gen-c.py b/tools/net/ynl/ynl-gen-c.py index 89d9471e9c2b..05b49aa459a7 100755 --- a/tools/net/ynl/ynl-gen-c.py +++ b/tools/net/ynl/ynl-gen-c.py @@ -58,8 +58,11 @@ class Type(SpecAttr): delattr(self, "enum_name") def resolve(self): - self.enum_name = f"{self.attr_set.name_prefix}{self.name}" - self.enum_name = c_upper(self.enum_name) + if 'name-prefix' in self.attr: + enum_name = f"{self.attr['name-prefix']}{self.name}" + else: + enum_name = f"{self.attr_set.name_prefix}{self.name}" + self.enum_name = c_upper(enum_name) def is_multi_val(self): return None -- cgit v1.2.3 From d4813b11d679c80d4c3e20d27dafcd6d3317a69c Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:38 -0700 Subject: netlink: specs: ethtool: add C render hints Most of the C enum names are guessed correctly, but there is a handful of corner cases we need to name explicitly. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 4846345bade4..b0e4147d0890 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -9,6 +9,7 @@ doc: Partial family for Ethtool Netlink. definitions: - name: udp-tunnel-type + enum-name: type: enum entries: [ vxlan, geneve, vxlan-gpe ] @@ -836,12 +837,15 @@ attribute-sets: - name: admin-state type: u32 + name-prefix: ethtool-a-podl-pse- - name: admin-control type: u32 + name-prefix: ethtool-a-podl-pse- - name: pw-d-status type: u32 + name-prefix: ethtool-a-podl-pse- - name: rss attributes: @@ -895,6 +899,7 @@ attribute-sets: operations: enum-model: directional + name-prefix: ethtool-msg- list: - name: strset-get -- cgit v1.2.3 From 180ad455273a7d3ba95ec21d28c1fee6766f166d Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:41 -0700 Subject: netlink: specs: ethtool: add empty enum stringset C does not allow defining structures and enums with the same name. Since enum ethtool_stringset exists in the uAPI we need to include at least a stub of it in the spec. This will trigger name collision avoidance in the code gen. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index b0e4147d0890..d674731629c4 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -12,6 +12,10 @@ definitions: enum-name: type: enum entries: [ vxlan, geneve, vxlan-gpe ] + - + name: stringset + type: enum + entries: [] attribute-sets: - -- cgit v1.2.3 From 37c852222712e1968da858961709a179150acd41 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:42 -0700 Subject: netlink: specs: ethtool: untangle UDP tunnels and cable test a bit UDP tunnel and cable test messages have a lot of nests, which do not match the names of the enum entries in C uAPI. Some of the structure / nesting also looks wrong. Untangle this a little bit based on the names, comments and educated guesses, I haven't actually tested the results. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 82 ++++++++++++++++++++++++-------- 1 file changed, 62 insertions(+), 20 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index d674731629c4..17b7b5028e2b 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -582,7 +582,7 @@ attribute-sets: name: phc-index type: u32 - - name: cable-test-ntf-nest-result + name: cable-result attributes: - name: pair @@ -591,7 +591,7 @@ attribute-sets: name: code type: u8 - - name: cable-test-ntf-nest-fault-length + name: cable-fault-length attributes: - name: pair @@ -600,18 +600,25 @@ attribute-sets: name: cm type: u32 - - name: cable-test-ntf-nest + name: cable-nest attributes: - name: result type: nest - nested-attributes: cable-test-ntf-nest-result + nested-attributes: cable-result - name: fault-length type: nest - nested-attributes: cable-test-ntf-nest-fault-length + nested-attributes: cable-fault-length - name: cable-test + attributes: + - + name: header + type: nest + nested-attributes: header + - + name: cable-test-ntf attributes: - name: header @@ -623,7 +630,7 @@ attribute-sets: - name: nest type: nest - nested-attributes: cable-test-ntf-nest + nested-attributes: cable-nest - name: cable-test-tdr-cfg attributes: @@ -637,8 +644,22 @@ attribute-sets: name: step type: u32 - - name: pari + name: pair type: u8 + - + name: cable-test-tdr-ntf + attributes: + - + name: header + type: nest + nested-attributes: header + - + name: status + type: u8 + - + name: nest + type: nest + nested-attributes: cable-nest - name: cable-test-tdr attributes: @@ -651,7 +672,7 @@ attribute-sets: type: nest nested-attributes: cable-test-tdr-cfg - - name: tunnel-info-udp-entry + name: tunnel-udp-entry attributes: - name: port @@ -662,7 +683,7 @@ attribute-sets: type: u32 enum: udp-tunnel-type - - name: tunnel-info-udp-table + name: tunnel-udp-table attributes: - name: size @@ -672,9 +693,17 @@ attribute-sets: type: nest nested-attributes: bitset - - name: udp-ports + name: entry type: nest - nested-attributes: tunnel-info-udp-entry + multi-attr: true + nested-attributes: tunnel-udp-entry + - + name: tunnel-udp + attributes: + - + name: table + type: nest + nested-attributes: tunnel-udp-table - name: tunnel-info attributes: @@ -685,7 +714,7 @@ attribute-sets: - name: udp-ports type: nest - nested-attributes: tunnel-info-udp-table + nested-attributes: tunnel-udp - name: fec-stat attributes: @@ -1357,10 +1386,16 @@ operations: request: attributes: - header - reply: - attributes: - - header - - cable-test-ntf-nest + - + name: cable-test-ntf + doc: Cable test notification. + + attribute-set: cable-test-ntf + + event: + attributes: + - header + - status - name: cable-test-tdr-act doc: Cable test TDR. @@ -1371,10 +1406,17 @@ operations: request: attributes: - header - reply: - attributes: - - header - - cable-test-tdr-cfg + - + name: cable-test-tdr-ntf + doc: Cable test TDR notification. + + attribute-set: cable-test-tdr-ntf + + event: + attributes: + - header + - status + - nest - name: tunnel-info-get doc: Get tsinfo params. -- cgit v1.2.3 From 709d0c3b3d4c385793fd12cc57e400c8e036e744 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:43 -0700 Subject: netlink: specs: ethtool: untangle stats-get Code gen for stats is a bit of a challenge, but from looking at the attrs I think that the format isn't quite right. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 17b7b5028e2b..00c1ab04b857 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -793,16 +793,29 @@ attribute-sets: type: u32 - name: stat - type: nest - nested-attributes: u64 + type: u64 + type-value: [ id ] - name: hist-rx type: nest - nested-attributes: u64 + nested-attributes: stats-grp-hist - name: hist-tx type: nest - nested-attributes: u64 + nested-attributes: stats-grp-hist + - + name: hist-bkt-low + type: u32 + - + name: hist-bkt-hi + type: u32 + - + name: hist-val + type: u64 + - + name: stats-grp-hist + subset-of: stats-grp + attributes: - name: hist-bkt-low type: u32 -- cgit v1.2.3 From 68335713d2eaf8e3e064c584b39da45fdee6e365 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Jun 2023 14:43:44 -0700 Subject: netlink: specs: ethtool: mark pads as pads Pad is a separate type. Even though in practice they can only be a u32 the value should be discarded. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 00c1ab04b857..837b565577ca 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -502,7 +502,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: tx-frames type: u64 @@ -720,7 +720,7 @@ attribute-sets: attributes: - name: pad - type: u8 + type: pad - name: corrected type: binary @@ -784,7 +784,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: id type: u32 @@ -830,7 +830,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: header type: nest -- cgit v1.2.3 From 25085b4e9251c77758964a8e8651338972353642 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Fri, 9 Jun 2023 22:50:53 -0500 Subject: bpf/docs: Update documentation for new cpumask kfuncs We recently added the bpf_cpumask_first_and() kfunc, and changed bpf_cpumask_any() / bpf_cpumask_any_and() to bpf_cpumask_any_distribute() and bpf_cpumask_any_distribute_and() respectively. This patch adds an entry for the bpf_cpumask_first_and() kfunc, and updates the documentation for the *any* kfuncs to the new names. Signed-off-by: David Vernet Acked-by: Yonghong Song Link: https://lore.kernel.org/r/20230610035053.117605-5-void@manifault.com Signed-off-by: Alexei Starovoitov --- Documentation/bpf/cpumasks.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst index 41efd8874eeb..3139c7c02e79 100644 --- a/Documentation/bpf/cpumasks.rst +++ b/Documentation/bpf/cpumasks.rst @@ -351,14 +351,15 @@ In addition to the above kfuncs, there is also a set of read-only kfuncs that can be used to query the contents of cpumasks. .. kernel-doc:: kernel/bpf/cpumask.c - :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu + :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_first_and + bpf_cpumask_test_cpu .. kernel-doc:: kernel/bpf/cpumask.c :identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset bpf_cpumask_empty bpf_cpumask_full .. kernel-doc:: kernel/bpf/cpumask.c - :identifiers: bpf_cpumask_any bpf_cpumask_any_and + :identifiers: bpf_cpumask_any_distribute bpf_cpumask_any_and_distribute ---- -- cgit v1.2.3 From 5b32c61a2dac17db3f87c65cada74c9d7f31f4fb Mon Sep 17 00:00:00 2001 From: Pranavi Somisetty Date: Mon, 12 Jun 2023 23:43:39 -0600 Subject: dt-bindings: net: cdns,macb: Add rx-watermark property watermark value is the minimum amount of packet data required to activate the forwarding process. The watermark implementation and maximum size is dependent on the device where Cadence MACB/GEM is used. Signed-off-by: Pranavi Somisetty Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/cdns,macb.yaml | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/cdns,macb.yaml b/Documentation/devicetree/bindings/net/cdns,macb.yaml index bef5e0f895be..bf8894a0257e 100644 --- a/Documentation/devicetree/bindings/net/cdns,macb.yaml +++ b/Documentation/devicetree/bindings/net/cdns,macb.yaml @@ -109,6 +109,16 @@ properties: power-domains: maxItems: 1 + cdns,rx-watermark: + $ref: /schemas/types.yaml#/definitions/uint32 + description: + When the receive partial store and forward mode is activated, + the receiver will only begin to forward the packet to the external + AHB or AXI slave when enough packet data is stored in the SRAM packet buffer. + rx-watermark corresponds to the number of SRAM buffer locations, + that need to be filled, before the forwarding process is activated. + Width of the SRAM is platform dependent, and can be 4, 8 or 16 bytes. + '#address-cells': const: 1 @@ -166,6 +176,7 @@ examples: compatible = "cdns,macb"; reg = <0xfffc4000 0x4000>; interrupts = <21>; + cdns,rx-watermark = <0x44>; phy-mode = "rmii"; local-mac-address = [3a 0e 03 04 05 06]; clock-names = "pclk", "hclk", "tx_clk"; -- cgit v1.2.3 From f7d625adeb7bc6a9ec83d32d9615889969d64484 Mon Sep 17 00:00:00 2001 From: David Arinzon Date: Mon, 12 Jun 2023 12:14:48 +0000 Subject: net: ena: Add dynamic recycling mechanism for rx buffers The current implementation allocates page-sized rx buffers. As traffic may consist of different types and sizes of packets, in various cases, buffers are not fully used. This change (Dynamic RX Buffers - DRB) uses part of the allocated rx page needed for the incoming packet, and returns the rest of the unused page to be used again as an rx buffer for future packets. A threshold of 2K for unused space has been set in order to declare whether the remainder of the page can be reused again as an rx buffer. As a page may be reused, dma_sync_single_for_cpu() is added in order to sync the memory to the CPU side after it was owned by the HW. In addition, when the rx page can no longer be reused, it is being unmapped using dma_page_unmap(), which implicitly syncs and then unmaps the entire page. In case the kernel still handles the skbs pointing to the previous buffers from that rx page, it may access garbage pointers, caused by the implicit sync overwriting them. The implicit dma sync is removed by replacing dma_page_unmap() with dma_unmap_page_attrs() with DMA_ATTR_SKIP_CPU_SYNC flag. The functionality is disabled for XDP traffic to avoid handling several descriptors per packet. Signed-off-by: Arthur Kiyanovski Signed-off-by: Shay Agroskin Signed-off-by: David Arinzon Link: https://lore.kernel.org/r/20230612121448.28829-1-darinzon@amazon.com Signed-off-by: Jakub Kicinski --- .../device_drivers/ethernet/amazon/ena.rst | 32 +++++ drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 6 +- drivers/net/ethernet/amazon/ena/ena_netdev.c | 136 ++++++++++++++------- drivers/net/ethernet/amazon/ena/ena_netdev.h | 4 + 4 files changed, 136 insertions(+), 42 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 8bcb173e0353..491492677632 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -205,6 +205,7 @@ Adaptive coalescing can be switched on/off through `ethtool(8)`'s More information about Adaptive Interrupt Moderation (DIM) can be found in Documentation/networking/net_dim.rst +.. _`RX copybreak`: RX copybreak ============ The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK @@ -315,3 +316,34 @@ Rx - The new SKB is updated with the necessary information (protocol, checksum hw verify result, etc), and then passed to the network stack, using the NAPI interface function :code:`napi_gro_receive()`. + +Dynamic RX Buffers (DRB) +------------------------ + +Each RX descriptor in the RX ring is a single memory page (which is either 4KB +or 16KB long depending on system's configurations). +To reduce the memory allocations required when dealing with a high rate of small +packets, the driver tries to reuse the remaining RX descriptor's space if more +than 2KB of this page remain unused. + +A simple example of this mechanism is the following sequence of events: + +:: + + 1. Driver allocates page-sized RX buffer and passes it to hardware + +----------------------+ + |4KB RX Buffer | + +----------------------+ + + 2. A 300Bytes packet is received on this buffer + + 3. The driver increases the ref count on this page and returns it back to + HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes + +----+--------------------+ + |****|3796 Bytes RX Buffer| + +----+--------------------+ + +This mechanism isn't used when an XDP program is loaded, or when the +RX packet is less than rx_copybreak bytes (in which case the packet is +copied out of the RX buffer into the linear part of a new skb allocated +for it and the RX buffer remains the same size, see `RX copybreak`_). diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h index 466ad9470d1f..6de0d590be34 100644 --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h @@ -869,7 +869,9 @@ struct ena_admin_host_info { * 2 : interrupt_moderation * 3 : rx_buf_mirroring * 4 : rss_configurable_function_key - * 31:5 : reserved + * 5 : reserved + * 6 : rx_page_reuse + * 31:7 : reserved */ u32 driver_supported_features; }; @@ -1184,6 +1186,8 @@ struct ena_admin_ena_mmio_req_read_less_resp { #define ENA_ADMIN_HOST_INFO_RX_BUF_MIRRORING_MASK BIT(3) #define ENA_ADMIN_HOST_INFO_RSS_CONFIGURABLE_FUNCTION_KEY_SHIFT 4 #define ENA_ADMIN_HOST_INFO_RSS_CONFIGURABLE_FUNCTION_KEY_MASK BIT(4) +#define ENA_ADMIN_HOST_INFO_RX_PAGE_REUSE_SHIFT 6 +#define ENA_ADMIN_HOST_INFO_RX_PAGE_REUSE_MASK BIT(6) /* aenq_common_desc */ #define ENA_ADMIN_AENQ_COMMON_DESC_PHASE_MASK BIT(0) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index e6a6efaeb87c..d19593fae226 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -1023,7 +1023,7 @@ static int ena_alloc_rx_buffer(struct ena_ring *rx_ring, int tailroom; /* restore page offset value in case it has been changed by device */ - rx_info->page_offset = headroom; + rx_info->buf_offset = headroom; /* if previous allocated page is not used */ if (unlikely(rx_info->page)) @@ -1040,6 +1040,8 @@ static int ena_alloc_rx_buffer(struct ena_ring *rx_ring, tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); rx_info->page = page; + rx_info->dma_addr = dma; + rx_info->page_offset = 0; ena_buf = &rx_info->ena_buf; ena_buf->paddr = dma + headroom; ena_buf->len = ENA_PAGE_SIZE - headroom - tailroom; @@ -1047,14 +1049,12 @@ static int ena_alloc_rx_buffer(struct ena_ring *rx_ring, return 0; } -static void ena_unmap_rx_buff(struct ena_ring *rx_ring, - struct ena_rx_buffer *rx_info) +static void ena_unmap_rx_buff_attrs(struct ena_ring *rx_ring, + struct ena_rx_buffer *rx_info, + unsigned long attrs) { - struct ena_com_buf *ena_buf = &rx_info->ena_buf; - - dma_unmap_page(rx_ring->dev, ena_buf->paddr - rx_ring->rx_headroom, - ENA_PAGE_SIZE, - DMA_BIDIRECTIONAL); + dma_unmap_page_attrs(rx_ring->dev, rx_info->dma_addr, ENA_PAGE_SIZE, + DMA_BIDIRECTIONAL, attrs); } static void ena_free_rx_page(struct ena_ring *rx_ring, @@ -1068,7 +1068,7 @@ static void ena_free_rx_page(struct ena_ring *rx_ring, return; } - ena_unmap_rx_buff(rx_ring, rx_info); + ena_unmap_rx_buff_attrs(rx_ring, rx_info, 0); __free_page(page); rx_info->page = NULL; @@ -1406,14 +1406,14 @@ static int ena_clean_tx_irq(struct ena_ring *tx_ring, u32 budget) return tx_pkts; } -static struct sk_buff *ena_alloc_skb(struct ena_ring *rx_ring, void *first_frag) +static struct sk_buff *ena_alloc_skb(struct ena_ring *rx_ring, void *first_frag, u16 len) { struct sk_buff *skb; if (!first_frag) - skb = napi_alloc_skb(rx_ring->napi, rx_ring->rx_copybreak); + skb = napi_alloc_skb(rx_ring->napi, len); else - skb = napi_build_skb(first_frag, ENA_PAGE_SIZE); + skb = napi_build_skb(first_frag, len); if (unlikely(!skb)) { ena_increase_stat(&rx_ring->rx_stats.skb_alloc_fail, 1, @@ -1422,24 +1422,47 @@ static struct sk_buff *ena_alloc_skb(struct ena_ring *rx_ring, void *first_frag) netif_dbg(rx_ring->adapter, rx_err, rx_ring->netdev, "Failed to allocate skb. first_frag %s\n", first_frag ? "provided" : "not provided"); - return NULL; } return skb; } +static bool ena_try_rx_buf_page_reuse(struct ena_rx_buffer *rx_info, u16 buf_len, + u16 len, int pkt_offset) +{ + struct ena_com_buf *ena_buf = &rx_info->ena_buf; + + /* More than ENA_MIN_RX_BUF_SIZE left in the reused buffer + * for data + headroom + tailroom. + */ + if (SKB_DATA_ALIGN(len + pkt_offset) + ENA_MIN_RX_BUF_SIZE <= ena_buf->len) { + page_ref_inc(rx_info->page); + rx_info->page_offset += buf_len; + ena_buf->paddr += buf_len; + ena_buf->len -= buf_len; + return true; + } + + return false; +} + static struct sk_buff *ena_rx_skb(struct ena_ring *rx_ring, struct ena_com_rx_buf_info *ena_bufs, u32 descs, u16 *next_to_clean) { + int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + bool is_xdp_loaded = ena_xdp_present_ring(rx_ring); struct ena_rx_buffer *rx_info; struct ena_adapter *adapter; + int page_offset, pkt_offset; + dma_addr_t pre_reuse_paddr; u16 len, req_id, buf = 0; + bool reuse_rx_buf_page; struct sk_buff *skb; - void *page_addr; - u32 page_offset; - void *data_addr; + void *buf_addr; + int buf_offset; + u16 buf_len; len = ena_bufs[buf].len; req_id = ena_bufs[buf].req_id; @@ -1459,34 +1482,30 @@ static struct sk_buff *ena_rx_skb(struct ena_ring *rx_ring, "rx_info %p page %p\n", rx_info, rx_info->page); - /* save virt address of first buffer */ - page_addr = page_address(rx_info->page); + buf_offset = rx_info->buf_offset; + pkt_offset = buf_offset - rx_ring->rx_headroom; page_offset = rx_info->page_offset; - data_addr = page_addr + page_offset; - - prefetch(data_addr); + buf_addr = page_address(rx_info->page) + page_offset; if (len <= rx_ring->rx_copybreak) { - skb = ena_alloc_skb(rx_ring, NULL); + skb = ena_alloc_skb(rx_ring, NULL, len); if (unlikely(!skb)) return NULL; - netif_dbg(rx_ring->adapter, rx_status, rx_ring->netdev, - "RX allocated small packet. len %d. data_len %d\n", - skb->len, skb->data_len); - /* sync this buffer for CPU use */ dma_sync_single_for_cpu(rx_ring->dev, - dma_unmap_addr(&rx_info->ena_buf, paddr), + dma_unmap_addr(&rx_info->ena_buf, paddr) + pkt_offset, len, DMA_FROM_DEVICE); - skb_copy_to_linear_data(skb, data_addr, len); + skb_copy_to_linear_data(skb, buf_addr + buf_offset, len); dma_sync_single_for_device(rx_ring->dev, - dma_unmap_addr(&rx_info->ena_buf, paddr), + dma_unmap_addr(&rx_info->ena_buf, paddr) + pkt_offset, len, DMA_FROM_DEVICE); skb_put(skb, len); + netif_dbg(rx_ring->adapter, rx_status, rx_ring->netdev, + "RX allocated small packet. len %d.\n", skb->len); skb->protocol = eth_type_trans(skb, rx_ring->netdev); rx_ring->free_ids[*next_to_clean] = req_id; *next_to_clean = ENA_RX_RING_IDX_ADD(*next_to_clean, descs, @@ -1494,14 +1513,28 @@ static struct sk_buff *ena_rx_skb(struct ena_ring *rx_ring, return skb; } - ena_unmap_rx_buff(rx_ring, rx_info); + buf_len = SKB_DATA_ALIGN(len + buf_offset + tailroom); + + pre_reuse_paddr = dma_unmap_addr(&rx_info->ena_buf, paddr); + + /* If XDP isn't loaded try to reuse part of the RX buffer */ + reuse_rx_buf_page = !is_xdp_loaded && + ena_try_rx_buf_page_reuse(rx_info, buf_len, len, pkt_offset); - skb = ena_alloc_skb(rx_ring, page_addr); + dma_sync_single_for_cpu(rx_ring->dev, + pre_reuse_paddr + pkt_offset, + len, + DMA_FROM_DEVICE); + + if (!reuse_rx_buf_page) + ena_unmap_rx_buff_attrs(rx_ring, rx_info, DMA_ATTR_SKIP_CPU_SYNC); + + skb = ena_alloc_skb(rx_ring, buf_addr, buf_len); if (unlikely(!skb)) return NULL; /* Populate skb's linear part */ - skb_reserve(skb, page_offset); + skb_reserve(skb, buf_offset); skb_put(skb, len); skb->protocol = eth_type_trans(skb, rx_ring->netdev); @@ -1510,7 +1543,8 @@ static struct sk_buff *ena_rx_skb(struct ena_ring *rx_ring, "RX skb updated. len %d. data_len %d\n", skb->len, skb->data_len); - rx_info->page = NULL; + if (!reuse_rx_buf_page) + rx_info->page = NULL; rx_ring->free_ids[*next_to_clean] = req_id; *next_to_clean = @@ -1525,10 +1559,28 @@ static struct sk_buff *ena_rx_skb(struct ena_ring *rx_ring, rx_info = &rx_ring->rx_buffer_info[req_id]; - ena_unmap_rx_buff(rx_ring, rx_info); + /* rx_info->buf_offset includes rx_ring->rx_headroom */ + buf_offset = rx_info->buf_offset; + pkt_offset = buf_offset - rx_ring->rx_headroom; + buf_len = SKB_DATA_ALIGN(len + buf_offset + tailroom); + page_offset = rx_info->page_offset; + + pre_reuse_paddr = dma_unmap_addr(&rx_info->ena_buf, paddr); + + reuse_rx_buf_page = !is_xdp_loaded && + ena_try_rx_buf_page_reuse(rx_info, buf_len, len, pkt_offset); + + dma_sync_single_for_cpu(rx_ring->dev, + pre_reuse_paddr + pkt_offset, + len, + DMA_FROM_DEVICE); + + if (!reuse_rx_buf_page) + ena_unmap_rx_buff_attrs(rx_ring, rx_info, + DMA_ATTR_SKIP_CPU_SYNC); skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_info->page, - rx_info->page_offset, len, ENA_PAGE_SIZE); + page_offset + buf_offset, len, buf_len); } while (1); @@ -1626,7 +1678,7 @@ static int ena_xdp_handle_buff(struct ena_ring *rx_ring, struct xdp_buff *xdp) rx_info = &rx_ring->rx_buffer_info[rx_ring->ena_bufs[0].req_id]; xdp_prepare_buff(xdp, page_address(rx_info->page), - rx_info->page_offset, + rx_info->buf_offset, rx_ring->ena_bufs[0].len, false); /* If for some reason we received a bigger packet than * we expect, then we simply drop it @@ -1638,7 +1690,7 @@ static int ena_xdp_handle_buff(struct ena_ring *rx_ring, struct xdp_buff *xdp) /* The xdp program might expand the headers */ if (ret == ENA_XDP_PASS) { - rx_info->page_offset = xdp->data - xdp->data_hard_start; + rx_info->buf_offset = xdp->data - xdp->data_hard_start; rx_ring->ena_bufs[0].len = xdp->data_end - xdp->data; } @@ -1693,7 +1745,7 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, struct napi_struct *napi, /* First descriptor might have an offset set by the device */ rx_info = &rx_ring->rx_buffer_info[rx_ring->ena_bufs[0].req_id]; - rx_info->page_offset += ena_rx_ctx.pkt_offset; + rx_info->buf_offset += ena_rx_ctx.pkt_offset; netif_dbg(rx_ring->adapter, rx_status, rx_ring->netdev, "rx_poll: q %d got packet from ena. descs #: %d l3 proto %d l4 proto %d hash: %x\n", @@ -1723,8 +1775,9 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, struct napi_struct *napi, * from RX side. */ if (xdp_verdict & ENA_XDP_FORWARDED) { - ena_unmap_rx_buff(rx_ring, - &rx_ring->rx_buffer_info[req_id]); + ena_unmap_rx_buff_attrs(rx_ring, + &rx_ring->rx_buffer_info[req_id], + 0); rx_ring->rx_buffer_info[req_id].page = NULL; } } @@ -3233,7 +3286,8 @@ static void ena_config_host_info(struct ena_com_dev *ena_dev, struct pci_dev *pd ENA_ADMIN_HOST_INFO_RX_OFFSET_MASK | ENA_ADMIN_HOST_INFO_INTERRUPT_MODERATION_MASK | ENA_ADMIN_HOST_INFO_RX_BUF_MIRRORING_MASK | - ENA_ADMIN_HOST_INFO_RSS_CONFIGURABLE_FUNCTION_KEY_MASK; + ENA_ADMIN_HOST_INFO_RSS_CONFIGURABLE_FUNCTION_KEY_MASK | + ENA_ADMIN_HOST_INFO_RX_PAGE_REUSE_MASK; rc = ena_com_set_host_attributes(ena_dev); if (rc) { diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index 5a0d4ee76172..248b715b4d68 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -51,6 +51,8 @@ #define ENA_DEFAULT_RING_SIZE (1024) #define ENA_MIN_RING_SIZE (256) +#define ENA_MIN_RX_BUF_SIZE (2048) + #define ENA_MIN_NUM_IO_QUEUES (1) #define ENA_TX_WAKEUP_THRESH (MAX_SKB_FRAGS + 2) @@ -175,7 +177,9 @@ struct ena_tx_buffer { struct ena_rx_buffer { struct sk_buff *skb; struct page *page; + dma_addr_t dma_addr; u32 page_offset; + u32 buf_offset; struct ena_com_buf ena_buf; } ____cacheline_aligned; -- cgit v1.2.3 From 7f6ee56ca0df0484338d12cd142fb5ebde8875a9 Mon Sep 17 00:00:00 2001 From: Christian Lamparter Date: Thu, 15 Jun 2023 14:41:53 +0300 Subject: dt-bindings: net: wireless: ath10k: add ieee80211-freq-limit property This is an existing optional property that ieee80211.yaml/cfg80211 provides. It's useful to further restrict supported frequencies for a specified device through device-tree. For testing the addition, I added the ieee80211-freq-limit property with values from an OpenMesh A62 device. This is because the OpenMesh A62 has "special filters in front of the RX+TX paths to the 5GHz PHYs. These filtered channel can in theory still be used by the hardware but the signal strength is reduced so much that it makes no sense." The driver supported this since ~2018 by commit 34d5629d2ca8 ("ath10k: limit available channels via DT ieee80211-freq-limit") Link: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=e3b8ae2b09e137ce2eae33551923daf302293a0c Signed-off-by: Christian Lamparter Acked-by: Krzysztof Kozlowski Signed-off-by: Kalle Valo Link: https://lore.kernel.org/r/c33c928b7c6c9bb4e7abe84eb8df9f440add275b.1686486464.git.chunkeey@gmail.com --- Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml index c85ed330426d..7758a55dd328 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml @@ -84,6 +84,8 @@ properties: required: - iommus + ieee80211-freq-limit: true + qcom,ath10k-calibration-data: $ref: /schemas/types.yaml#/definitions/uint8-array description: @@ -164,6 +166,7 @@ required: additionalProperties: false allOf: + - $ref: ieee80211.yaml# - if: properties: compatible: @@ -355,4 +358,5 @@ examples: "msi14", "msi15", "legacy"; + ieee80211-freq-limit = <5470000 5875000>; }; -- cgit v1.2.3 From c8013a1f714f6d9f2d8d673177a824c6b9653218 Mon Sep 17 00:00:00 2001 From: Or Har-Toov Date: Thu, 23 Mar 2023 18:11:50 +0200 Subject: net/mlx5e: Add local loopback counter to vport stats Add counter for number of unicast, multicast and broadcast packets/ octets that were loopback. Signed-off-by: Or Har-Toov Reviewed-by: Avihai Horon Reviewed-by: Leon Romanovsky Signed-off-by: Saeed Mahameed --- .../ethernet/mellanox/mlx5/counters.rst | 10 ++++++++++ drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 23 +++++++++++++++++++++- 2 files changed, 32 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 6b2d1fe74ecf..a395df9c2751 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -797,6 +797,16 @@ Counters on the NIC port that is connected to a eSwitch. RoCE/UD/RC traffic) [#accel]_. - Acceleration + * - `vport_loopback_packets` + - Unicast, multicast and broadcast packets that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + + * - `vport_loopback_bytes` + - Unicast, multicast and broadcast bytes that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + * - `rx_steer_missed_packets` - Number of packets that was received by the NIC, however was discarded because it did not match any flow in the NIC flow table. diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c index f1d9596905c6..25a6c596300d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c @@ -748,11 +748,22 @@ static const struct counter_desc vport_stats_desc[] = { VPORT_COUNTER_OFF(transmitted_ib_multicast.octets) }, }; +static const struct counter_desc vport_loopback_stats_desc[] = { + { "vport_loopback_packets", + VPORT_COUNTER_OFF(local_loopback.packets) }, + { "vport_loopback_bytes", + VPORT_COUNTER_OFF(local_loopback.octets) }, +}; + #define NUM_VPORT_COUNTERS ARRAY_SIZE(vport_stats_desc) +#define NUM_VPORT_LOOPBACK_COUNTERS(dev) \ + (MLX5_CAP_GEN(dev, vport_counter_local_loopback) ? \ + ARRAY_SIZE(vport_loopback_stats_desc) : 0) static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(vport) { - return NUM_VPORT_COUNTERS; + return NUM_VPORT_COUNTERS + + NUM_VPORT_LOOPBACK_COUNTERS(priv->mdev); } static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(vport) @@ -761,6 +772,11 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(vport) for (i = 0; i < NUM_VPORT_COUNTERS; i++) strcpy(data + (idx++) * ETH_GSTRING_LEN, vport_stats_desc[i].format); + + for (i = 0; i < NUM_VPORT_LOOPBACK_COUNTERS(priv->mdev); i++) + strcpy(data + (idx++) * ETH_GSTRING_LEN, + vport_loopback_stats_desc[i].format); + return idx; } @@ -771,6 +787,11 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STATS(vport) for (i = 0; i < NUM_VPORT_COUNTERS; i++) data[idx++] = MLX5E_READ_CTR64_BE(priv->stats.vport.query_vport_out, vport_stats_desc, i); + + for (i = 0; i < NUM_VPORT_LOOPBACK_COUNTERS(priv->mdev); i++) + data[idx++] = MLX5E_READ_CTR64_BE(priv->stats.vport.query_vport_out, + vport_loopback_stats_desc, i); + return idx; } -- cgit v1.2.3 From 6907217a8054b8afc47b3944afc7d77ad5caf824 Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Thu, 15 Jun 2023 16:14:05 +0100 Subject: netlink: specs: fixup openvswitch specs for code generation Refine the ovs_* specs to align exactly with the ovs netlink UAPI definitions to enable code generation. Signed-off-by: Donald Hunter Link: https://lore.kernel.org/r/20230615151405.77649-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/ovs_datapath.yaml | 30 ++++++++---- Documentation/netlink/specs/ovs_flow.yaml | 68 ++++++++++++++++++++++----- Documentation/netlink/specs/ovs_vport.yaml | 13 ++++- 3 files changed, 87 insertions(+), 24 deletions(-) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ovs_datapath.yaml b/Documentation/netlink/specs/ovs_datapath.yaml index 6d71db8c4416..f709c26c3e92 100644 --- a/Documentation/netlink/specs/ovs_datapath.yaml +++ b/Documentation/netlink/specs/ovs_datapath.yaml @@ -3,6 +3,7 @@ name: ovs_datapath version: 2 protocol: genetlink-legacy +uapi-header: linux/openvswitch.h doc: OVS datapath configuration over generic netlink. @@ -18,6 +19,7 @@ definitions: - name: user-features type: flags + name-prefix: ovs-dp-f- entries: - name: unaligned @@ -33,35 +35,37 @@ definitions: doc: Allow per-cpu dispatch of upcalls - name: datapath-stats + enum-name: ovs-dp-stats type: struct members: - - name: hit + name: n-hit type: u64 - - name: missed + name: n-missed type: u64 - - name: lost + name: n-lost type: u64 - - name: flows + name: n-flows type: u64 - name: megaflow-stats + enum-name: ovs-dp-megaflow-stats type: struct members: - - name: mask-hit + name: n-mask-hit type: u64 - - name: masks + name: n-masks type: u32 - name: padding type: u32 - - name: cache-hits + name: n-cache-hit type: u64 - name: pad1 @@ -70,6 +74,8 @@ definitions: attribute-sets: - name: datapath + name-prefix: ovs-dp-attr- + enum-name: ovs-datapath-attrs attributes: - name: name @@ -101,12 +107,16 @@ attribute-sets: name: per-cpu-pids type: binary sub-type: u32 + - + name: ifindex + type: u32 operations: fixed-header: ovs-header + name-prefix: ovs-dp-cmd- list: - - name: dp-get + name: get doc: Get / dump OVS data path configuration and state value: 3 attribute-set: datapath @@ -125,7 +135,7 @@ operations: - per-cpu-pids dump: *dp-get-op - - name: dp-new + name: new doc: Create new OVS data path value: 1 attribute-set: datapath @@ -137,7 +147,7 @@ operations: - upcall-pid - user-features - - name: dp-del + name: del doc: Delete existing OVS data path value: 2 attribute-set: datapath diff --git a/Documentation/netlink/specs/ovs_flow.yaml b/Documentation/netlink/specs/ovs_flow.yaml index 3b0624c87074..1ecbcd117385 100644 --- a/Documentation/netlink/specs/ovs_flow.yaml +++ b/Documentation/netlink/specs/ovs_flow.yaml @@ -3,6 +3,7 @@ name: ovs_flow version: 1 protocol: genetlink-legacy +uapi-header: linux/openvswitch.h doc: OVS flow configuration over generic netlink. @@ -67,6 +68,7 @@ definitions: enum: ovs-frag-type - name: ovs-frag-type + name-prefix: ovs-frag-type- type: enum entries: - @@ -166,6 +168,7 @@ definitions: doc: Tag control identifier (TCI) to push. - name: ovs-ufid-flags + name-prefix: ovs-ufid-f- type: flags entries: - omit-key @@ -176,7 +179,7 @@ definitions: type: struct members: - - name: hash-algorithm + name: hash-alg type: u32 doc: Algorithm used to compute hash prior to recirculation. - @@ -198,13 +201,13 @@ definitions: type: struct members: - - name: lse + name: mpls-lse type: u32 byte-order: big-endian doc: | MPLS label stack entry to push - - name: ethertype + name: mpls-ethertype type: u32 byte-order: big-endian doc: | @@ -216,13 +219,13 @@ definitions: type: struct members: - - name: lse + name: mpls-lse type: u32 byte-order: big-endian doc: | MPLS label stack entry to push - - name: ethertype + name: mpls-ethertype type: u32 byte-order: big-endian doc: | @@ -237,6 +240,7 @@ definitions: - name: ct-state-flags type: flags + name-prefix: ovs-cs-f- entries: - name: new @@ -266,6 +270,8 @@ definitions: attribute-sets: - name: flow-attrs + enum-name: ovs-flow-attr + name-prefix: ovs-flow-attr- attributes: - name: key @@ -352,6 +358,8 @@ attribute-sets: - name: key-attrs + enum-name: ovs-key-attr + name-prefix: ovs-key-attr- attributes: - name: encap @@ -481,6 +489,8 @@ attribute-sets: doc: struct ovs_key_ipv6_exthdr - name: action-attrs + enum-name: ovs-action-attr + name-prefix: ovs-action-attr- attributes: - name: output @@ -608,6 +618,8 @@ attribute-sets: nested-attributes: dec-ttl-attrs - name: tunnel-key-attrs + enum-name: ovs-tunnel-key-attr + name-prefix: ovs-tunnel-key-attr- attributes: - name: id @@ -676,6 +688,8 @@ attribute-sets: type: flag - name: check-pkt-len-attrs + enum-name: ovs-check-pkt-len-attr + name-prefix: ovs-check-pkt-len-attr- attributes: - name: pkt-len @@ -690,6 +704,8 @@ attribute-sets: nested-attributes: action-attrs - name: sample-attrs + enum-name: ovs-sample-attr + name-prefix: ovs-sample-attr- attributes: - name: probability @@ -700,6 +716,8 @@ attribute-sets: nested-attributes: action-attrs - name: userspace-attrs + enum-name: ovs-userspace-attr + name-prefix: ovs-userspace-attr- attributes: - name: pid @@ -715,6 +733,8 @@ attribute-sets: type: flag - name: ovs-nsh-key-attrs + enum-name: ovs-nsh-key-attr + name-prefix: ovs-nsh-key-attr- attributes: - name: base @@ -727,6 +747,8 @@ attribute-sets: type: binary - name: ct-attrs + enum-name: ovs-ct-attr + name-prefix: ovs-ct-attr- attributes: - name: commit @@ -758,13 +780,15 @@ attribute-sets: type: string - name: nat-attrs + enum-name: ovs-nat-attr + name-prefix: ovs-nat-attr- attributes: - name: src - type: binary + type: flag - name: dst - type: binary + type: flag - name: ip-min type: binary @@ -773,21 +797,23 @@ attribute-sets: type: binary - name: proto-min - type: binary + type: u16 - name: proto-max - type: binary + type: u16 - name: persistent - type: binary + type: flag - name: proto-hash - type: binary + type: flag - name: proto-random - type: binary + type: flag - name: dec-ttl-attrs + enum-name: ovs-dec-ttl-attr + name-prefix: ovs-dec-ttl-attr- attributes: - name: action @@ -795,16 +821,19 @@ attribute-sets: nested-attributes: action-attrs - name: vxlan-ext-attrs + enum-name: ovs-vxlan-ext- + name-prefix: ovs-vxlan-ext- attributes: - name: gbp type: u32 operations: + name-prefix: ovs-flow-cmd- fixed-header: ovs-header list: - - name: flow-get + name: get doc: Get / dump OVS flow configuration and state value: 3 attribute-set: flow-attrs @@ -824,6 +853,19 @@ operations: - stats - actions dump: *flow-get-op + - + name: new + doc: Create OVS flow configuration in a data path + value: 1 + attribute-set: flow-attrs + do: + request: + attributes: + - dp-ifindex + - key + - ufid + - mask + - actions mcast-groups: list: diff --git a/Documentation/netlink/specs/ovs_vport.yaml b/Documentation/netlink/specs/ovs_vport.yaml index 8e55622ddf11..17336455bec1 100644 --- a/Documentation/netlink/specs/ovs_vport.yaml +++ b/Documentation/netlink/specs/ovs_vport.yaml @@ -3,6 +3,7 @@ name: ovs_vport version: 2 protocol: genetlink-legacy +uapi-header: linux/openvswitch.h doc: OVS vport configuration over generic netlink. @@ -18,10 +19,13 @@ definitions: - name: vport-type type: enum + enum-name: ovs-vport-type + name-prefix: ovs-vport-type- entries: [ unspec, netdev, internal, gre, vxlan, geneve ] - name: vport-stats type: struct + enum-name: ovs-vport-stats members: - name: rx-packets @@ -51,6 +55,8 @@ definitions: attribute-sets: - name: vport-options + enum-name: ovs-vport-options + name-prefix: ovs-tunnel-attr- attributes: - name: dst-port @@ -60,6 +66,8 @@ attribute-sets: type: u32 - name: upcall-stats + enum-name: ovs-vport-upcall-attr + name-prefix: ovs-vport-upcall-attr- attributes: - name: success @@ -70,6 +78,8 @@ attribute-sets: type: u64 - name: vport + name-prefix: ovs-vport-attr- + enum-name: ovs-vport-attr attributes: - name: port-no @@ -108,9 +118,10 @@ attribute-sets: nested-attributes: upcall-stats operations: + name-prefix: ovs-vport-cmd- list: - - name: vport-get + name: get doc: Get / dump OVS vport configuration and state value: 3 attribute-set: vport -- cgit v1.2.3 From b650d953cd391595e536153ce30b4aab385643ac Mon Sep 17 00:00:00 2001 From: "mfreemon@cloudflare.com" Date: Sun, 11 Jun 2023 22:05:24 -0500 Subject: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.rst | 15 +++++++++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 +++++ net/ipv4/tcp_ipv4.c | 2 ++ net/ipv4/tcp_output.c | 60 +++++++++++++++++++++++++++++----- 5 files changed, 78 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 366e2a5097d9..4a010a7cde7f 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -981,6 +981,21 @@ tcp_tw_reuse - INTEGER tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. +tcp_shrink_window - BOOLEAN + This changes how the TCP receive window is calculated. + + RFC 7323, section 2.4, says there are instances when a retracted + window can be offered, and that TCP implementations MUST ensure + that they handle a shrinking window, as specified in RFC 1122. + + - 0 - Disabled. The window is never shrunk. + - 1 - Enabled. The window is shrunk when necessary to remain within + the memory limit set by autotuning (sk_rcvbuf). + This only occurs if a non-zero receive window + scaling factor is also in effect. + + Default: 0 + tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index a4efb7a2796c..f00374718159 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -65,6 +65,7 @@ struct netns_ipv4 { #endif bool fib_has_custom_local_routes; bool fib_offload_disabled; + u8 sysctl_tcp_shrink_window; #ifdef CONFIG_IP_ROUTE_CLASSID atomic_t fib_num_tclassid_users; #endif diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 356afe54951c..2afb0870648b 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1480,6 +1480,15 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = &tcp_syn_linear_timeouts_max, }, + { + .procname = "tcp_shrink_window", + .data = &init_net.ipv4.sysctl_tcp_shrink_window, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = proc_dou8vec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { } }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 84a5d557dc1a..9213804b034f 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3281,6 +3281,8 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.tcp_congestion_control = &tcp_reno; net->ipv4.sysctl_tcp_syn_linear_timeouts = 4; + net->ipv4.sysctl_tcp_shrink_window = 0; + return 0; } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 660eac4bf2a7..2cb39b6dad02 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk) u32 old_win = tp->rcv_wnd; u32 cur_win = tcp_receive_window(tp); u32 new_win = __tcp_select_window(sk); + struct net *net = sock_net(sk); - /* Never shrink the offered window */ if (new_win < cur_win) { /* Danger Will Robinson! * Don't update rcv_wup/rcv_wnd here or else @@ -270,11 +270,14 @@ static u16 tcp_select_window(struct sock *sk) * * Relax Will Robinson. */ - if (new_win == 0) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPWANTZEROWINDOWADV); - new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) || !tp->rx_opt.rcv_wscale) { + /* Never shrink the offered window */ + if (new_win == 0) + NET_INC_STATS(net, LINUX_MIB_TCPWANTZEROWINDOWADV); + new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + } } + tp->rcv_wnd = new_win; tp->rcv_wup = tp->rcv_nxt; @@ -282,7 +285,7 @@ static u16 tcp_select_window(struct sock *sk) * scaled window. */ if (!tp->rx_opt.rcv_wscale && - READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) + READ_ONCE(net->ipv4.sysctl_tcp_workaround_signed_windows)) new_win = min(new_win, MAX_TCP_WINDOW); else new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale)); @@ -294,10 +297,9 @@ static u16 tcp_select_window(struct sock *sk) if (new_win == 0) { tp->pred_flags = 0; if (old_win) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPTOZEROWINDOWADV); + NET_INC_STATS(net, LINUX_MIB_TCPTOZEROWINDOWADV); } else if (old_win == 0) { - NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV); + NET_INC_STATS(net, LINUX_MIB_TCPFROMZEROWINDOWADV); } return new_win; @@ -2987,6 +2989,7 @@ u32 __tcp_select_window(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); + struct net *net = sock_net(sk); /* MSS for the peer's data. Previous versions used mss_clamp * here. I don't know if the value based on our guesses * of peer's MSS is better for the performance. It's more correct @@ -3008,6 +3011,15 @@ u32 __tcp_select_window(struct sock *sk) if (mss <= 0) return 0; } + + /* Only allow window shrink if the sysctl is enabled and we have + * a non-zero scaling factor in effect. + */ + if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) && tp->rx_opt.rcv_wscale) + goto shrink_window_allowed; + + /* do not allow window to shrink */ + if (free_space < (full_space >> 1)) { icsk->icsk_ack.quick = 0; @@ -3062,6 +3074,36 @@ u32 __tcp_select_window(struct sock *sk) } return window; + +shrink_window_allowed: + /* new window should always be an exact multiple of scaling factor */ + free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale); + + if (free_space < (full_space >> 1)) { + icsk->icsk_ack.quick = 0; + + if (tcp_under_memory_pressure(sk)) + tcp_adjust_rcv_ssthresh(sk); + + /* if free space is too low, return a zero window */ + if (free_space < (allowed_space >> 4) || free_space < mss || + free_space < (1 << tp->rx_opt.rcv_wscale)) + return 0; + } + + if (free_space > tp->rcv_ssthresh) { + free_space = tp->rcv_ssthresh; + /* new window should always be an exact multiple of scaling factor + * + * For this case, we ALIGN "up" (increase free_space) because + * we know free_space is not zero here, it has been reduced from + * the memory-based limit, and rcv_ssthresh is not a hard limit + * (unlike sk_rcvbuf). + */ + free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale)); + } + + return free_space; } void tcp_skb_collapse_tstamp(struct sk_buff *skb, -- cgit v1.2.3 From 264879fdbea0c3093057d48f3dcc7afeea433fb7 Mon Sep 17 00:00:00 2001 From: Michael Walle Date: Fri, 16 Jun 2023 12:45:57 +0200 Subject: dt-bindings: net: phy: gpy2xx: more precise description Mention that the interrupt line is just asserted for a random period of time, not the entire time. Suggested-by: Rob Herring Signed-off-by: Michael Walle Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml index d71fa9de2b64..8a3713abd1ca 100644 --- a/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml +++ b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml @@ -17,11 +17,12 @@ properties: maxlinear,use-broken-interrupts: description: | Interrupts are broken on some GPY2xx PHYs in that they keep the - interrupt line asserted even after the interrupt status register is - cleared. Thus it is blocking the interrupt line which is usually bad - for shared lines. By default interrupts are disabled for this PHY and - polling mode is used. If one can live with the consequences, this - property can be used to enable interrupt handling. + interrupt line asserted for a random amount of time even after the + interrupt status register is cleared. Thus it is blocking the + interrupt line which is usually bad for shared lines. By default, + interrupts are disabled for this PHY and polling mode is used. If one + can live with the consequences, this property can be used to enable + interrupt handling. Affected PHYs (as far as known) are GPY215B and GPY215C. type: boolean -- cgit v1.2.3 From a05d070a6164bd0578991e42181a52b9c7cf630c Mon Sep 17 00:00:00 2001 From: Rahul Rameshbabu Date: Mon, 12 Jun 2023 14:14:52 -0700 Subject: ptp: Clarify ptp_clock_info .adjphase expects an internal servo to be used .adjphase expects a PHC to use an internal servo algorithm to correct the provided phase offset target in the callback. Implementation of the internal servo algorithm are defined by the individual devices. Cc: Jakub Kicinski Cc: Richard Cochran Signed-off-by: Rahul Rameshbabu Acked-by: Richard Cochran Signed-off-by: David S. Miller --- Documentation/driver-api/ptp.rst | 16 ++++++++++++++++ include/linux/ptp_clock_kernel.h | 6 ++++-- 2 files changed, 20 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/ptp.rst b/Documentation/driver-api/ptp.rst index 664838ae7776..4552a1f20488 100644 --- a/Documentation/driver-api/ptp.rst +++ b/Documentation/driver-api/ptp.rst @@ -73,6 +73,22 @@ Writing clock drivers class driver, since the lock may also be needed by the clock driver's interrupt service routine. +PTP hardware clock requirements for '.adjphase' +----------------------------------------------- + + The 'struct ptp_clock_info' interface has a '.adjphase' function. + This function has a set of requirements from the PHC in order to be + implemented. + + * The PHC implements a servo algorithm internally that is used to + correct the offset passed in the '.adjphase' call. + * When other PTP adjustment functions are called, the PHC servo + algorithm is disabled. + + **NOTE:** '.adjphase' is not a simple time adjustment functionality + that 'jumps' the PHC clock time based on the provided offset. It + should correct the offset provided using an internal algorithm. + Supported hardware ================== diff --git a/include/linux/ptp_clock_kernel.h b/include/linux/ptp_clock_kernel.h index fdffa6a98d79..f8e8443a8b35 100644 --- a/include/linux/ptp_clock_kernel.h +++ b/include/linux/ptp_clock_kernel.h @@ -77,8 +77,10 @@ struct ptp_system_timestamp { * nominal frequency in parts per million, but with a * 16 bit binary fractional field. * - * @adjphase: Adjusts the phase offset of the hardware clock. - * parameter delta: Desired change in nanoseconds. + * @adjphase: Indicates that the PHC should use an internal servo + * algorithm to correct the provided phase offset. + * parameter delta: PHC servo phase adjustment target + * in nanoseconds. * * @adjtime: Shifts the time of the hardware clock. * parameter delta: Desired change in nanoseconds. -- cgit v1.2.3 From fe3834cd0cf74b06847a1001ac44b7e2c035b5bc Mon Sep 17 00:00:00 2001 From: Rahul Rameshbabu Date: Mon, 12 Jun 2023 14:14:53 -0700 Subject: docs: ptp.rst: Add information about NVIDIA Mellanox devices The mlx5_core driver has implemented ptp clock driver functionality but lacked documentation about the PTP devices. This patch adds information about the Mellanox device family. Signed-off-by: Rahul Rameshbabu Acked-by: Richard Cochran Signed-off-by: David S. Miller --- Documentation/driver-api/ptp.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/driver-api/ptp.rst b/Documentation/driver-api/ptp.rst index 4552a1f20488..5e033c3b11b3 100644 --- a/Documentation/driver-api/ptp.rst +++ b/Documentation/driver-api/ptp.rst @@ -122,3 +122,16 @@ Supported hardware - LPF settings (bandwidth, phase limiting, automatic holdover, physical layer assist (per ITU-T G.8273.2)) - Programmable output PTP clocks, any frequency up to 1GHz (to other PHY/MAC time stampers, refclk to ASSPs/SoCs/FPGAs) - Lock to GNSS input, automatic switching between GNSS and user-space PHC control (optional) + + * NVIDIA Mellanox + + - GPIO + - Certain variants of ConnectX-6 Dx and later products support one + GPIO which can time stamp external triggers and one GPIO to produce + periodic signals. + - Certain variants of ConnectX-5 and older products support one GPIO, + configured to either time stamp external triggers or produce + periodic signals. + - PHC instances + - All ConnectX devices have a free-running counter + - ConnectX-6 Dx and later devices have a UTC format counter -- cgit v1.2.3 From d0e3d29f8771068a6f8df2a148080c449bc5b046 Mon Sep 17 00:00:00 2001 From: Bartosz Golaszewski Date: Mon, 19 Jun 2023 11:24:01 +0200 Subject: dt-bindings: net: qcom,ethqos: add description for sa8775p Add the compatible for the MAC controller on sa8775p platforms. This MAC works with a single interrupt so add minItems to the interrupts property. The fourth clock's name is different here so change it. Enable relevant PHY properties. Add the relevant compatibles to the binding document for snps,dwmac as well. Signed-off-by: Bartosz Golaszewski Reviewed-by: Krzysztof Kozlowski Signed-off-by: Jakub Kicinski --- Documentation/devicetree/bindings/net/qcom,ethqos.yaml | 12 +++++++++++- Documentation/devicetree/bindings/net/snps,dwmac.yaml | 3 +++ 2 files changed, 14 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qcom,ethqos.yaml b/Documentation/devicetree/bindings/net/qcom,ethqos.yaml index 60a38044fb19..7bdb412a0185 100644 --- a/Documentation/devicetree/bindings/net/qcom,ethqos.yaml +++ b/Documentation/devicetree/bindings/net/qcom,ethqos.yaml @@ -20,6 +20,7 @@ properties: compatible: enum: - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos @@ -32,11 +33,13 @@ properties: - const: rgmii interrupts: + minItems: 1 items: - description: Combined signal for various interrupt events - description: The interrupt that occurs when Rx exits the LPI state interrupt-names: + minItems: 1 items: - const: macirq - const: eth_lpi @@ -49,11 +52,18 @@ properties: - const: stmmaceth - const: pclk - const: ptp_ref - - const: rgmii + - enum: + - rgmii + - phyaux iommus: maxItems: 1 + phys: true + + phy-names: + const: serdes + required: - compatible - clocks diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml index 363b3e3ea3a6..ddf9522a5dc2 100644 --- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml +++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml @@ -67,6 +67,7 @@ properties: - loongson,ls2k-dwmac - loongson,ls7a-dwmac - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos - renesas,r9a06g032-gmac @@ -582,6 +583,7 @@ allOf: - ingenic,x1600-mac - ingenic,x1830-mac - ingenic,x2000-mac + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - snps,dwmac-3.50a - snps,dwmac-4.10a @@ -638,6 +640,7 @@ allOf: - ingenic,x1830-mac - ingenic,x2000-mac - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos - snps,dwmac-4.00 -- cgit v1.2.3 From 6a0a6dd8df9b6e9c0ed8d99bbfb4e5e2f8c9834f Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Sat, 17 Jun 2023 18:57:16 +0200 Subject: dt-bindings: net: bluetooth: qualcomm: document VDD_CH1 WCN3990 comes with two chains - CH0 and CH1 - where each takes VDD regulator. It seems VDD_CH1 is optional (Linux driver does not care about it), so document it to fix dtbs_check warnings like: sdm850-lenovo-yoga-c630.dtb: bluetooth: 'vddch1-supply' does not match any of the regexes: 'pinctrl-[0-9]+' Signed-off-by: Krzysztof Kozlowski Acked-by: Conor Dooley Link: https://lore.kernel.org/r/20230617165716.279857-1-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski --- .../devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml index 68f78b90d23a..604985c8068e 100644 --- a/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml +++ b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml @@ -50,6 +50,9 @@ properties: vddch0-supply: description: VDD_CH0 supply regulator handle + vddch1-supply: + description: VDD_CH1 supply regulator handle + vddaon-supply: description: VDD_AON supply regulator handle -- cgit v1.2.3 From 1ca09f5746edd5e483d144118497f622af9dbe60 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Mon, 19 Jun 2023 19:01:34 +0200 Subject: dt-bindings: net: micrel,ks8851: allow SPI device properties The Micrel KS8851 can be attached to SPI or parallel bus and the difference is expressed in compatibles. Allow common SPI properties when this is a SPI variant and narrow the parallel memory bus properties to the second case. This fixes dtbs_check warning: qcom-msm8960-cdp.dtb: ethernet@0: Unevaluated properties are not allowed ('spi-max-frequency' was unexpected) Signed-off-by: Krzysztof Kozlowski Reviewed-by: Conor Dooley Link: https://lore.kernel.org/r/20230619170134.65395-1-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski --- Documentation/devicetree/bindings/net/micrel,ks8851.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/micrel,ks8851.yaml b/Documentation/devicetree/bindings/net/micrel,ks8851.yaml index b44d83554ef5..b726c6e14633 100644 --- a/Documentation/devicetree/bindings/net/micrel,ks8851.yaml +++ b/Documentation/devicetree/bindings/net/micrel,ks8851.yaml @@ -44,13 +44,13 @@ required: allOf: - $ref: ethernet-controller.yaml# - - $ref: /schemas/memory-controllers/mc-peripheral-props.yaml# - if: properties: compatible: contains: const: micrel,ks8851 then: + $ref: /schemas/spi/spi-peripheral-props.yaml# properties: reg: maxItems: 1 @@ -60,6 +60,7 @@ allOf: contains: const: micrel,ks8851-mll then: + $ref: /schemas/memory-controllers/mc-peripheral-props.yaml# properties: reg: minItems: 2 -- cgit v1.2.3 From 5dfbbaa208f5429a02ccb410ae3515222bbe64ef Mon Sep 17 00:00:00 2001 From: David Arinzon Date: Tue, 20 Jun 2023 13:35:44 +0000 Subject: net: ena: Fix rst format issues in readme This patch fixes a warning in the ena documentation file identified by the kernel automatic tools. The patch also adds a missing newline between sections. Signed-off-by: David Arinzon Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202306171804.U7E92zoE-lkp@intel.com/ Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230620133544.32584-1-darinzon@amazon.com Signed-off-by: Jakub Kicinski --- Documentation/networking/device_drivers/ethernet/amazon/ena.rst | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 491492677632..5eaa3ab6c73e 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -38,6 +38,7 @@ debug logs. Some of the ENA devices support a working mode called Low-latency Queue (LLQ), which saves several more microseconds. + ENA Source Code Directory Structure =================================== @@ -206,6 +207,7 @@ More information about Adaptive Interrupt Moderation (DIM) can be found in Documentation/networking/net_dim.rst .. _`RX copybreak`: + RX copybreak ============ The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK -- cgit v1.2.3 From 2404dd01b53430e4ab78fc9ca069e9e93fd22059 Mon Sep 17 00:00:00 2001 From: Anton Protopopov Date: Thu, 22 Jun 2023 09:54:07 +0000 Subject: bpf, docs: BPF Iterator Document Fix the description of the seq_info field of the bpf_iter_reg structure which was wrong due to an accidental copy/paste of the previous field's description. Fixes: 8972e18a439d ("bpf, docs: BPF Iterator Document") Signed-off-by: Anton Protopopov Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20230622095407.1024053-1-aspsk@isovalent.com --- Documentation/bpf/bpf_iterators.rst | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst index 6d7770793fab..07433915aa41 100644 --- a/Documentation/bpf/bpf_iterators.rst +++ b/Documentation/bpf/bpf_iterators.rst @@ -238,11 +238,8 @@ The following is the breakdown for each field in struct ``bpf_iter_reg``. that the kernel function cond_resched() is called to avoid other kernel subsystem (e.g., rcu) misbehaving. * - seq_info - - Specifies certain action requests in the kernel BPF iterator - infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means - that the kernel function cond_resched() is called to avoid other kernel - subsystem (e.g., rcu) misbehaving. - + - Specifies the set of seq operations for the BPF iterator and helpers to + initialize/free the private data for the corresponding ``seq_file``. `Click here `_ -- cgit v1.2.3 From fbc5669de62a452fb3a26a4560668637d5c9e7b5 Mon Sep 17 00:00:00 2001 From: Anton Protopopov Date: Thu, 22 Jun 2023 09:54:24 +0000 Subject: bpf, docs: Document existing macros instead of deprecated The BTF_TYPE_SAFE_NESTED macro was replaced by the BTF_TYPE_SAFE_TRUSTED, BTF_TYPE_SAFE_RCU, and BTF_TYPE_SAFE_RCU_OR_NULL macros. Fix the docs correspondingly. Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.") Signed-off-by: Anton Protopopov Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20230622095424.1024244-1-aspsk@isovalent.com --- Documentation/bpf/kfuncs.rst | 38 ++++++++++++++++++++++++++++++++------ 1 file changed, 32 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 7a3d9de5f315..0d2647fb358d 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -227,23 +227,49 @@ absolutely no ABI stability guarantees. As mentioned above, a nested pointer obtained from walking a trusted pointer is no longer trusted, with one exception. If a struct type has a field that is -guaranteed to be valid as long as its parent pointer is trusted, the -``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as -follows: +guaranteed to be valid (trusted or rcu, as in KF_RCU description below) as long +as its parent pointer is valid, the following macros can be used to express +that to the verifier: + +* ``BTF_TYPE_SAFE_TRUSTED`` +* ``BTF_TYPE_SAFE_RCU`` +* ``BTF_TYPE_SAFE_RCU_OR_NULL`` + +For example, + +.. code-block:: c + + BTF_TYPE_SAFE_TRUSTED(struct socket) { + struct sock *sk; + }; + +or .. code-block:: c - BTF_TYPE_SAFE_NESTED(struct task_struct) { + BTF_TYPE_SAFE_RCU(struct task_struct) { const cpumask_t *cpus_ptr; + struct css_set __rcu *cgroups; + struct task_struct __rcu *real_parent; + struct task_struct *group_leader; }; In other words, you must: -1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. +1. Wrap the valid pointer type in a ``BTF_TYPE_SAFE_*`` macro. -2. Specify the type and name of the trusted nested field. This field must match +2. Specify the type and name of the valid nested field. This field must match the field in the original type definition exactly. +A new type declared by a ``BTF_TYPE_SAFE_*`` macro also needs to be emitted so +that it appears in BTF. For example, ``BTF_TYPE_SAFE_TRUSTED(struct socket)`` +is emitted in the ``type_is_trusted()`` function as follows: + +.. code-block:: c + + BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct socket)); + + 2.4.5 KF_SLEEPABLE flag ----------------------- -- cgit v1.2.3 From 2ffb8d02a9b60d9190a871cb8466cd0721bc0a49 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Wed, 21 Jun 2023 11:26:53 +0200 Subject: docs: ABI: sysfs-class-led-trigger-netdev: add new modes and entry Document newly introduced modes and entry for the LED netdev trigger. Add documentation for new modes: - link_10 - link_100 - link_1000 - half_duplex - full_duplex Add documentation for new entry: - hw_control Also add additional info for the interval entry and the tx/rx modes with the special case of hw_control ON. Signed-off-by: Christian Marangi Reviewed-by: Andrew Lunn Link: https://lore.kernel.org/r/20230621092653.23172-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski --- .../ABI/testing/sysfs-class-led-trigger-netdev | 89 ++++++++++++++++++++++ 1 file changed, 89 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev index 646540950e38..78b62a23b14a 100644 --- a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev @@ -13,6 +13,11 @@ Description: Specifies the duration of the LED blink in milliseconds. Defaults to 50 ms. + With hw_control ON, the interval value MUST be set to the + default value and cannot be changed. + Trying to set any value in this specific mode will return + an EINVAL error. + What: /sys/class/leds//link Date: Dec 2017 KernelVersion: 4.16 @@ -39,6 +44,9 @@ Description: If set to 1, the LED will blink for the milliseconds specified in interval to signal transmission. + With hw_control ON, the blink interval is controlled by hardware + and won't reflect the value set in interval. + What: /sys/class/leds//rx Date: Dec 2017 KernelVersion: 4.16 @@ -50,3 +58,84 @@ Description: If set to 1, the LED will blink for the milliseconds specified in interval to signal reception. + + With hw_control ON, the blink interval is controlled by hardware + and won't reflect the value set in interval. + +What: /sys/class/leds//hw_control +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Communicate whether the LED trigger modes are driven by hardware + or software fallback is used. + + If 0, the LED is using software fallback to blink. + + If 1, the LED is using hardware control to blink and signal the + requested modes. + +What: /sys/class/leds//link_10 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 10Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 10MBps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds//link_100 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 100Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 100Mbps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds//link_1000 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 1000Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 1000Mbps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds//half_duplex +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link half duplex state of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link half + duplex state of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds//full_duplex +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link full duplex state of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link full + duplex state of the named network device. + Setting this value also immediately changes the LED state. -- cgit v1.2.3 From faaa5fd30344f9a7b3816ae7a6b58ccd5a34998f Mon Sep 17 00:00:00 2001 From: Rob Herring Date: Wed, 21 Jun 2023 17:10:12 -0600 Subject: dt-bindings: net: altr,tse: Fix error in "compatible" conditional schema The conditional if/then schema has an error as the "enum" values have "const" in them. Drop the "const". Signed-off-by: Rob Herring Reviewed-by: Conor Dooley Link: https://lore.kernel.org/r/20230621231012.3816139-1-robh@kernel.org Signed-off-by: Paolo Abeni --- Documentation/devicetree/bindings/net/altr,tse.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/altr,tse.yaml b/Documentation/devicetree/bindings/net/altr,tse.yaml index 9d02af468906..f5d3b70af07a 100644 --- a/Documentation/devicetree/bindings/net/altr,tse.yaml +++ b/Documentation/devicetree/bindings/net/altr,tse.yaml @@ -72,8 +72,8 @@ allOf: compatible: contains: enum: - - const: altr,tse-1.0 - - const: ALTR,tse-1.0 + - altr,tse-1.0 + - ALTR,tse-1.0 then: properties: reg: -- cgit v1.2.3 From 25c24801d7da8f9cb088bc0ad5947d2e84035411 Mon Sep 17 00:00:00 2001 From: Shay Drory Date: Mon, 19 Jun 2023 07:42:07 +0300 Subject: net/mlx5: Fix SFs kernel documentation error Indent SFs probe code example in order to fix the below error: Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst:57: ERROR: Unexpected indentation. Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst:61: ERROR: Unexpected indentation. Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs") Signed-off-by: Shay Drory Reviewed-by: Rahul Rameshbabu Signed-off-by: Saeed Mahameed Reviewed-by: Moshe Shemesh Reviewed-by: Automatic Verification Reviewed-by: Gal Pressman --- .../ethernet/mellanox/mlx5/switchdev.rst | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst index db62187eebce..6e3f5ee8b0d0 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -51,19 +51,21 @@ This will allow user to configure the SF before the SF have been fully probed, which will save time. Usage example: -Create SF: -$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 -$ devlink port function set pci/0000:08:00.0/32768 \ - hw_addr 00:00:00:00:00:11 state active -Enable ETH auxiliary device: -$ devlink dev param set auxiliary/mlx5_core.sf.1 \ - name enable_eth value true cmode driverinit +- Create SF:: -Now, in order to fully probe the SF, use devlink reload: -$ devlink dev reload auxiliary/mlx5_core.sf.1 + $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 + $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active -mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst`) +- Enable ETH auxiliary device:: + + $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit + +- Now, in order to fully probe the SF, use devlink reload:: + + $ devlink dev reload auxiliary/mlx5_core.sf.1 + +mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst `). mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst `) interface. -- cgit v1.2.3 From 737eab775d367b0ba6575d8aa2073c607a9bcf9e Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Fri, 23 Jun 2023 21:19:26 +0100 Subject: netlink: specs: add display-hint to schema definitions Add a display-hint property to the netlink schema that is for providing optional hints to generic netlink clients about how to display attribute values. A display-hint on an attribute definition is intended for letting a client such as ynl know that, for example, a u32 should be rendered as an ipv4 address. The display-hint enumeration includes a small number of networking domain-specific value types. Signed-off-by: Donald Hunter Link: https://lore.kernel.org/r/20230623201928.14275-2-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/netlink/genetlink-c.yaml | 6 ++++++ Documentation/netlink/genetlink-legacy.yaml | 11 ++++++++++- Documentation/netlink/genetlink.yaml | 6 ++++++ 3 files changed, 22 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/netlink/genetlink-c.yaml b/Documentation/netlink/genetlink-c.yaml index 0519c257ecf4..57d1c1c4918f 100644 --- a/Documentation/netlink/genetlink-c.yaml +++ b/Documentation/netlink/genetlink-c.yaml @@ -195,6 +195,12 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] # Start genetlink-c name-prefix: type: string diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index b474889b49ff..43b769c98fb2 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -119,7 +119,8 @@ properties: name: type: string type: - enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ] + description: The netlink attribute type + enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string, binary ] len: $ref: '#/$defs/len-or-define' byte-order: @@ -130,6 +131,12 @@ properties: enum: description: Name of the enum type used for the attribute. type: string + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] # End genetlink-legacy attribute-sets: @@ -179,6 +186,7 @@ properties: name: type: string type: &attr-type + description: The netlink attribute type enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64, string, nest, array-nest, nest-type-value ] doc: @@ -226,6 +234,7 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: *display-hint # Start genetlink-c name-prefix: type: string diff --git a/Documentation/netlink/genetlink.yaml b/Documentation/netlink/genetlink.yaml index d8b2cdeba058..1cbb448d2f1c 100644 --- a/Documentation/netlink/genetlink.yaml +++ b/Documentation/netlink/genetlink.yaml @@ -168,6 +168,12 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] # Make sure name-prefix does not appear in subsets (subsets inherit naming) dependencies: -- cgit v1.2.3 From 334f39ce17eff255f243ab5998af27ae40f9f04c Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Fri, 23 Jun 2023 21:19:28 +0100 Subject: netlink: specs: add display hints to ovs_flow Add display hints for mac, ipv4, ipv6, hex and uuid to the ovs_flow schema. Signed-off-by: Donald Hunter Link: https://lore.kernel.org/r/20230623201928.14275-4-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/ovs_flow.yaml | 107 ++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) (limited to 'Documentation') diff --git a/Documentation/netlink/specs/ovs_flow.yaml b/Documentation/netlink/specs/ovs_flow.yaml index 1ecbcd117385..109ca1f57b6c 100644 --- a/Documentation/netlink/specs/ovs_flow.yaml +++ b/Documentation/netlink/specs/ovs_flow.yaml @@ -33,6 +33,20 @@ definitions: name: n-bytes type: u64 doc: Number of matched bytes. + - + name: ovs-key-ethernet + type: struct + members: + - + name: eth-src + type: binary + len: 6 + display-hint: mac + - + name: eth-dst + type: binary + len: 6 + display-hint: mac - name: ovs-key-mpls type: struct @@ -49,10 +63,12 @@ definitions: name: ipv4-src type: u32 byte-order: big-endian + display-hint: ipv4 - name: ipv4-dst type: u32 byte-order: big-endian + display-hint: ipv4 - name: ipv4-proto type: u8 @@ -66,6 +82,45 @@ definitions: name: ipv4-frag type: u8 enum: ovs-frag-type + - + name: ovs-key-ipv6 + type: struct + members: + - + name: ipv6-src + type: binary + len: 16 + byte-order: big-endian + display-hint: ipv6 + - + name: ipv6-dst + type: binary + len: 16 + byte-order: big-endian + display-hint: ipv6 + - + name: ipv6-label + type: u32 + byte-order: big-endian + - + name: ipv6-proto + type: u8 + - + name: ipv6-tclass + type: u8 + - + name: ipv6-hlimit + type: u8 + - + name: ipv6-frag + type: u8 + - + name: ovs-key-ipv6-exthdrs + type: struct + members: + - + name: hdrs + type: u16 - name: ovs-frag-type name-prefix: ovs-frag-type- @@ -129,6 +184,51 @@ definitions: - name: icmp-code type: u8 + - + name: ovs-key-arp + type: struct + members: + - + name: arp-sip + type: u32 + byte-order: big-endian + - + name: arp-tip + type: u32 + byte-order: big-endian + - + name: arp-op + type: u16 + byte-order: big-endian + - + name: arp-sha + type: binary + len: 6 + display-hint: mac + - + name: arp-tha + type: binary + len: 6 + display-hint: mac + - + name: ovs-key-nd + type: struct + members: + - + name: nd_target + type: binary + len: 16 + byte-order: big-endian + - + name: nd-sll + type: binary + len: 6 + display-hint: mac + - + name: nd-tll + type: binary + len: 6 + display-hint: mac - name: ovs-key-ct-tuple-ipv4 type: struct @@ -345,6 +445,7 @@ attribute-sets: value of the OVS_FLOW_ATTR_KEY attribute. Optional for all requests. Present in notifications if the flow was created with this attribute. + display-hint: uuid - name: ufid-flags type: u32 @@ -374,6 +475,7 @@ attribute-sets: - name: ethernet type: binary + struct: ovs-key-ethernet doc: struct ovs_key_ethernet - name: vlan @@ -390,6 +492,7 @@ attribute-sets: - name: ipv6 type: binary + struct: ovs-key-ipv6 doc: struct ovs_key_ipv6 - name: tcp @@ -410,10 +513,12 @@ attribute-sets: - name: arp type: binary + struct: ovs-key-arp doc: struct ovs_key_arp - name: nd type: binary + struct: ovs-key-nd doc: struct ovs_key_nd - name: skb-mark @@ -457,6 +562,7 @@ attribute-sets: - name: ct-labels type: binary + display-hint: hex doc: 16-octet connection tracking label - name: ct-orig-tuple-ipv4 @@ -486,6 +592,7 @@ attribute-sets: - name: ipv6-exthdrs type: binary + struct: ovs-key-ipv6-exthdrs doc: struct ovs_key_ipv6_exthdr - name: action-attrs -- cgit v1.2.3 From dc97391e661009eab46783030d2404c9b6e6f2e7 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 23 Jun 2023 23:55:12 +0100 Subject: sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells Acked-by: Marc Kleine-Budde # for net/can cc: Jens Axboe cc: Matthew Wilcox cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- Documentation/bpf/map_sockmap.rst | 10 ++--- Documentation/filesystems/locking.rst | 2 - Documentation/filesystems/vfs.rst | 1 - Documentation/networking/scaling.rst | 4 +- crypto/af_alg.c | 28 ------------- crypto/algif_aead.c | 22 ++-------- crypto/algif_rng.c | 2 - crypto/algif_skcipher.c | 14 ------- .../ethernet/chelsio/inline_crypto/chtls/chtls.h | 2 - .../chelsio/inline_crypto/chtls/chtls_io.c | 14 ------- .../chelsio/inline_crypto/chtls/chtls_main.c | 1 - fs/nfsd/vfs.c | 2 +- include/crypto/if_alg.h | 2 - include/linux/net.h | 8 ---- include/net/inet_common.h | 2 - include/net/sock.h | 6 --- include/net/tcp.h | 4 -- net/appletalk/ddp.c | 1 - net/atm/pvc.c | 1 - net/atm/svc.c | 1 - net/ax25/af_ax25.c | 1 - net/caif/caif_socket.c | 2 - net/can/bcm.c | 1 - net/can/isotp.c | 1 - net/can/j1939/socket.c | 1 - net/can/raw.c | 1 - net/core/sock.c | 35 +--------------- net/dccp/ipv4.c | 1 - net/dccp/ipv6.c | 1 - net/ieee802154/socket.c | 2 - net/ipv4/af_inet.c | 21 ---------- net/ipv4/tcp.c | 43 ++----------------- net/ipv4/tcp_bpf.c | 23 +---------- net/ipv4/tcp_ipv4.c | 1 - net/ipv4/udp.c | 15 ------- net/ipv4/udp_impl.h | 2 - net/ipv4/udplite.c | 1 - net/ipv6/af_inet6.c | 3 -- net/ipv6/raw.c | 1 - net/ipv6/tcp_ipv6.c | 1 - net/kcm/kcmsock.c | 20 --------- net/key/af_key.c | 1 - net/l2tp/l2tp_ip.c | 1 - net/l2tp/l2tp_ip6.c | 1 - net/llc/af_llc.c | 1 - net/mctp/af_mctp.c | 1 - net/mptcp/protocol.c | 2 - net/netlink/af_netlink.c | 1 - net/netrom/af_netrom.c | 1 - net/packet/af_packet.c | 2 - net/phonet/socket.c | 2 - net/qrtr/af_qrtr.c | 1 - net/rds/af_rds.c | 1 - net/rose/af_rose.c | 1 - net/rxrpc/af_rxrpc.c | 1 - net/sctp/protocol.c | 1 - net/socket.c | 48 ---------------------- net/tipc/socket.c | 3 -- net/tls/tls.h | 6 --- net/tls/tls_device.c | 17 -------- net/tls/tls_main.c | 7 ---- net/tls/tls_sw.c | 35 ---------------- net/unix/af_unix.c | 19 --------- net/vmw_vsock/af_vsock.c | 3 -- net/x25/af_x25.c | 1 - net/xdp/xsk.c | 1 - 66 files changed, 20 insertions(+), 442 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/map_sockmap.rst b/Documentation/bpf/map_sockmap.rst index cc92047c6630..2d630686a00b 100644 --- a/Documentation/bpf/map_sockmap.rst +++ b/Documentation/bpf/map_sockmap.rst @@ -240,11 +240,11 @@ offsets into ``msg``, respectively. If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only parse data that the (``data``, ``data_end``) pointers have already consumed. For ``sendmsg()`` hooks this is likely the first scatterlist element. But for -calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be -the range (**0**, **0**) because the data is shared with user space and by -default the objective is to avoid allowing user space to modify data while (or -after) BPF verdict is being decided. This helper can be used to pull in data -and to set the start and end pointers to given values. Data will be copied if +calls relying on MSG_SPLICE_PAGES (e.g., ``sendfile()``) this will be the +range (**0**, **0**) because the data is shared with user space and by default +the objective is to avoid allowing user space to modify data while (or after) +BPF verdict is being decided. This helper can be used to pull in data and to +set the start and end pointers to given values. Data will be copied if necessary (i.e., if data was not linear and if start and end pointers do not point to the same chunk). diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index aa1a233b0fa8..ed148919e11a 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -521,8 +521,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*sendpage) (struct file *, struct page *, int, size_t, - loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 769be5230210..cb2a97e49872 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -1086,7 +1086,6 @@ This describes how the VFS can manipulate an open file. As of kernel int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst index 3d435caa3ef2..92c9fb46d6a2 100644 --- a/Documentation/networking/scaling.rst +++ b/Documentation/networking/scaling.rst @@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes. rps_sock_flow_table is a global flow table that contains the *desired* CPU for flows: the CPU that is currently processing the flow in userspace. Each table value is a CPU index that is updated during calls to recvmsg -and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() -and tcp_splice_read()). +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and +tcp_splice_read()). When the scheduler moves a thread to a new CPU while it has outstanding receive packets on the old CPU, packets may arrive out of order. To diff --git a/crypto/af_alg.c b/crypto/af_alg.c index cdb1dcc5dd1a..6218c773d71c 100644 --- a/crypto/af_alg.c +++ b/crypto/af_alg.c @@ -482,7 +482,6 @@ static const struct proto_ops alg_proto_ops = { .listen = sock_no_listen, .shutdown = sock_no_shutdown, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .sendmsg = sock_no_sendmsg, .recvmsg = sock_no_recvmsg, @@ -1106,33 +1105,6 @@ unlock: } EXPORT_SYMBOL_GPL(af_alg_sendmsg); -/** - * af_alg_sendpage - sendpage system call handler - * @sock: socket of connection to user space to write to - * @page: data to send - * @offset: offset into page to begin sending - * @size: length of data - * @flags: message send/receive flags - * - * This is a generic implementation of sendpage to fill ctx->tsgl_list. - */ -ssize_t af_alg_sendpage(struct socket *sock, struct page *page, - int offset, size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { - .msg_flags = flags | MSG_SPLICE_PAGES, - }; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return sock_sendmsg(sock, &msg); -} -EXPORT_SYMBOL_GPL(af_alg_sendpage); - /** * af_alg_free_resources - release resources required for crypto request * @areq: Request holding the TX and RX SGL diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c index 35bfa283748d..7d58cbbce4af 100644 --- a/crypto/algif_aead.c +++ b/crypto/algif_aead.c @@ -9,10 +9,10 @@ * The following concept of the memory management is used: * * The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is - * filled by user space with the data submitted via sendpage. Filling up - * the TX SGL does not cause a crypto operation -- the data will only be - * tracked by the kernel. Upon receipt of one recvmsg call, the caller must - * provide a buffer which is tracked with the RX SGL. + * filled by user space with the data submitted via sendmsg (maybe with + * MSG_SPLICE_PAGES). Filling up the TX SGL does not cause a crypto operation + * -- the data will only be tracked by the kernel. Upon receipt of one recvmsg + * call, the caller must provide a buffer which is tracked with the RX SGL. * * During the processing of the recvmsg operation, the cipher request is * allocated and prepared. As part of the recvmsg operation, the processed @@ -370,7 +370,6 @@ static struct proto_ops algif_aead_ops = { .release = af_alg_release, .sendmsg = aead_sendmsg, - .sendpage = af_alg_sendpage, .recvmsg = aead_recvmsg, .poll = af_alg_poll, }; @@ -422,18 +421,6 @@ static int aead_sendmsg_nokey(struct socket *sock, struct msghdr *msg, return aead_sendmsg(sock, msg, size); } -static ssize_t aead_sendpage_nokey(struct socket *sock, struct page *page, - int offset, size_t size, int flags) -{ - int err; - - err = aead_check_key(sock); - if (err) - return err; - - return af_alg_sendpage(sock, page, offset, size, flags); -} - static int aead_recvmsg_nokey(struct socket *sock, struct msghdr *msg, size_t ignored, int flags) { @@ -461,7 +448,6 @@ static struct proto_ops algif_aead_ops_nokey = { .release = af_alg_release, .sendmsg = aead_sendmsg_nokey, - .sendpage = aead_sendpage_nokey, .recvmsg = aead_recvmsg_nokey, .poll = af_alg_poll, }; diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c index 407408c43730..10c41adac3b1 100644 --- a/crypto/algif_rng.c +++ b/crypto/algif_rng.c @@ -174,7 +174,6 @@ static struct proto_ops algif_rng_ops = { .bind = sock_no_bind, .accept = sock_no_accept, .sendmsg = sock_no_sendmsg, - .sendpage = sock_no_sendpage, .release = af_alg_release, .recvmsg = rng_recvmsg, @@ -192,7 +191,6 @@ static struct proto_ops __maybe_unused algif_rng_test_ops = { .mmap = sock_no_mmap, .bind = sock_no_bind, .accept = sock_no_accept, - .sendpage = sock_no_sendpage, .release = af_alg_release, .recvmsg = rng_test_recvmsg, diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c index b1f321b9f846..9ada9b741af8 100644 --- a/crypto/algif_skcipher.c +++ b/crypto/algif_skcipher.c @@ -194,7 +194,6 @@ static struct proto_ops algif_skcipher_ops = { .release = af_alg_release, .sendmsg = skcipher_sendmsg, - .sendpage = af_alg_sendpage, .recvmsg = skcipher_recvmsg, .poll = af_alg_poll, }; @@ -246,18 +245,6 @@ static int skcipher_sendmsg_nokey(struct socket *sock, struct msghdr *msg, return skcipher_sendmsg(sock, msg, size); } -static ssize_t skcipher_sendpage_nokey(struct socket *sock, struct page *page, - int offset, size_t size, int flags) -{ - int err; - - err = skcipher_check_key(sock); - if (err) - return err; - - return af_alg_sendpage(sock, page, offset, size, flags); -} - static int skcipher_recvmsg_nokey(struct socket *sock, struct msghdr *msg, size_t ignored, int flags) { @@ -285,7 +272,6 @@ static struct proto_ops algif_skcipher_ops_nokey = { .release = af_alg_release, .sendmsg = skcipher_sendmsg_nokey, - .sendpage = skcipher_sendpage_nokey, .recvmsg = skcipher_recvmsg_nokey, .poll = af_alg_poll, }; diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h index da4818d2c856..68562a82d036 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h @@ -569,8 +569,6 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); int chtls_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len); void chtls_splice_eof(struct socket *sock); -int chtls_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags); int send_tx_flowc_wr(struct sock *sk, int compl, u32 snd_nxt, u32 rcv_nxt); void chtls_tcp_push(struct sock *sk, int flags); diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c index e08ac960c967..5fc64e47568a 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c @@ -1246,20 +1246,6 @@ void chtls_splice_eof(struct socket *sock) release_sock(sk); } -int chtls_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags) -{ - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - struct bio_vec bvec; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return chtls_sendmsg(sk, &msg, size); -} - static void chtls_select_window(struct sock *sk) { struct chtls_sock *csk = rcu_dereference_sk_user_data(sk); diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c index 6b6787eafd2f..455a54708be4 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c @@ -607,7 +607,6 @@ static void __init chtls_init_ulp_ops(void) chtls_cpl_prot.shutdown = chtls_shutdown; chtls_cpl_prot.sendmsg = chtls_sendmsg; chtls_cpl_prot.splice_eof = chtls_splice_eof; - chtls_cpl_prot.sendpage = chtls_sendpage; chtls_cpl_prot.recvmsg = chtls_recvmsg; chtls_cpl_prot.setsockopt = chtls_setsockopt; chtls_cpl_prot.getsockopt = chtls_getsockopt; diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index db67f8e19344..8879e207ff5a 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -936,7 +936,7 @@ nfsd_open_verified(struct svc_rqst *rqstp, struct svc_fh *fhp, int may_flags, /* * Grab and keep cached pages associated with a file in the svc_rqst - * so that they can be passed to the network sendmsg/sendpage routines + * so that they can be passed to the network sendmsg routines * directly. They will be released after the sending has completed. * * Return values: Number of bytes consumed, or -EIO if there are no diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h index 34224e77f5a2..ef8ce86b1f78 100644 --- a/include/crypto/if_alg.h +++ b/include/crypto/if_alg.h @@ -229,8 +229,6 @@ void af_alg_wmem_wakeup(struct sock *sk); int af_alg_wait_for_data(struct sock *sk, unsigned flags, unsigned min); int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size, unsigned int ivsize); -ssize_t af_alg_sendpage(struct socket *sock, struct page *page, - int offset, size_t size, int flags); void af_alg_free_resources(struct af_alg_async_req *areq); void af_alg_async_cb(void *data, int err); __poll_t af_alg_poll(struct file *file, struct socket *sock, diff --git a/include/linux/net.h b/include/linux/net.h index 23324e9a2b3d..41c608c1b02c 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -207,8 +207,6 @@ struct proto_ops { size_t total_len, int flags); int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); - ssize_t (*sendpage) (struct socket *sock, struct page *page, - int offset, size_t size, int flags); ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); void (*splice_eof)(struct socket *sock); @@ -222,8 +220,6 @@ struct proto_ops { sk_read_actor_t recv_actor); /* This is different from read_sock(), it reads an entire skb at a time. */ int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor); - int (*sendpage_locked)(struct sock *sk, struct page *page, - int offset, size_t size, int flags); int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg, size_t size); int (*set_rcvlowat)(struct sock *sk, int val); @@ -341,10 +337,6 @@ int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen, int flags); int kernel_getsockname(struct socket *sock, struct sockaddr *addr); int kernel_getpeername(struct socket *sock, struct sockaddr *addr); -int kernel_sendpage(struct socket *sock, struct page *page, int offset, - size_t size, int flags); -int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset, - size_t size, int flags); int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how); /* Routine returns the IP overhead imposed by a (caller-protected) socket. */ diff --git a/include/net/inet_common.h b/include/net/inet_common.h index a75333342c4e..b86b8e21de7f 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -36,8 +36,6 @@ void __inet_accept(struct socket *sock, struct socket *newsock, int inet_send_prepare(struct sock *sk); int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size); void inet_splice_eof(struct socket *sock); -ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, - size_t size, int flags); int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags); int inet_shutdown(struct socket *sock, int how); diff --git a/include/net/sock.h b/include/net/sock.h index 62a1b99da349..121284f455a8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1277,8 +1277,6 @@ struct proto { size_t len); int (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len); - int (*sendpage)(struct sock *sk, struct page *page, - int offset, size_t size, int flags); void (*splice_eof)(struct socket *sock); int (*bind)(struct sock *sk, struct sockaddr *addr, int addr_len); @@ -1919,10 +1917,6 @@ int sock_no_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len); int sock_no_recvmsg(struct socket *, struct msghdr *, size_t, int); int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma); -ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, - size_t size, int flags); -ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page, - int offset, size_t size, int flags); /* * Functions to fill in entries in struct proto_ops when a protocol diff --git a/include/net/tcp.h b/include/net/tcp.h index 31b534370787..226bce6d1e8c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -329,10 +329,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size); int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied, size_t size, struct ubuf_info *uarg); void tcp_splice_eof(struct socket *sock); -int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, - int flags); -int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset, - size_t size, int flags); int tcp_send_mss(struct sock *sk, int *size_goal, int flags); int tcp_wmem_schedule(struct sock *sk, int copy); void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle, diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c index a06f4d4a6f47..8978fb6212ff 100644 --- a/net/appletalk/ddp.c +++ b/net/appletalk/ddp.c @@ -1929,7 +1929,6 @@ static const struct proto_ops atalk_dgram_ops = { .sendmsg = atalk_sendmsg, .recvmsg = atalk_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct notifier_block ddp_notifier = { diff --git a/net/atm/pvc.c b/net/atm/pvc.c index 53e7d3f39e26..66d9a9bd5896 100644 --- a/net/atm/pvc.c +++ b/net/atm/pvc.c @@ -126,7 +126,6 @@ static const struct proto_ops pvc_proto_ops = { .sendmsg = vcc_sendmsg, .recvmsg = vcc_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; diff --git a/net/atm/svc.c b/net/atm/svc.c index d83556d8beb9..36a814f1fbd1 100644 --- a/net/atm/svc.c +++ b/net/atm/svc.c @@ -654,7 +654,6 @@ static const struct proto_ops svc_proto_ops = { .sendmsg = vcc_sendmsg, .recvmsg = vcc_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c index d8da400cb4de..5db805d5f74d 100644 --- a/net/ax25/af_ax25.c +++ b/net/ax25/af_ax25.c @@ -2022,7 +2022,6 @@ static const struct proto_ops ax25_proto_ops = { .sendmsg = ax25_sendmsg, .recvmsg = ax25_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; /* diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c index 4eebcc66c19a..9c82698da4f5 100644 --- a/net/caif/caif_socket.c +++ b/net/caif/caif_socket.c @@ -976,7 +976,6 @@ static const struct proto_ops caif_seqpacket_ops = { .sendmsg = caif_seqpkt_sendmsg, .recvmsg = caif_seqpkt_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static const struct proto_ops caif_stream_ops = { @@ -996,7 +995,6 @@ static const struct proto_ops caif_stream_ops = { .sendmsg = caif_stream_sendmsg, .recvmsg = caif_stream_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; /* This function is called when a socket is finally destroyed. */ diff --git a/net/can/bcm.c b/net/can/bcm.c index a962ec2b8ba5..9ba35685b043 100644 --- a/net/can/bcm.c +++ b/net/can/bcm.c @@ -1703,7 +1703,6 @@ static const struct proto_ops bcm_ops = { .sendmsg = bcm_sendmsg, .recvmsg = bcm_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto bcm_proto __read_mostly = { diff --git a/net/can/isotp.c b/net/can/isotp.c index 84f9aba02901..1f25b45868cf 100644 --- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -1699,7 +1699,6 @@ static const struct proto_ops isotp_ops = { .sendmsg = isotp_sendmsg, .recvmsg = isotp_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto isotp_proto __read_mostly = { diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c index 35970c25496a..feaec4ad6d16 100644 --- a/net/can/j1939/socket.c +++ b/net/can/j1939/socket.c @@ -1306,7 +1306,6 @@ static const struct proto_ops j1939_ops = { .sendmsg = j1939_sk_sendmsg, .recvmsg = j1939_sk_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto j1939_proto __read_mostly = { diff --git a/net/can/raw.c b/net/can/raw.c index f64469b98260..15c79b079184 100644 --- a/net/can/raw.c +++ b/net/can/raw.c @@ -962,7 +962,6 @@ static const struct proto_ops raw_ops = { .sendmsg = raw_sendmsg, .recvmsg = raw_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto raw_proto __read_mostly = { diff --git a/net/core/sock.c b/net/core/sock.c index 5f1747c12004..de719094b804 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3261,36 +3261,6 @@ void __receive_sock(struct file *file) } } -ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags) -{ - ssize_t res; - struct msghdr msg = {.msg_flags = flags}; - struct kvec iov; - char *kaddr = kmap(page); - iov.iov_base = kaddr + offset; - iov.iov_len = size; - res = kernel_sendmsg(sock, &msg, &iov, 1, size); - kunmap(page); - return res; -} -EXPORT_SYMBOL(sock_no_sendpage); - -ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page, - int offset, size_t size, int flags) -{ - ssize_t res; - struct msghdr msg = {.msg_flags = flags}; - struct kvec iov; - char *kaddr = kmap(page); - - iov.iov_base = kaddr + offset; - iov.iov_len = size; - res = kernel_sendmsg_locked(sk, &msg, &iov, 1, size); - kunmap(page); - return res; -} -EXPORT_SYMBOL(sock_no_sendpage_locked); - /* * Default Socket Callbacks */ @@ -4046,7 +4016,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto) { seq_printf(seq, "%-9s %4u %6d %6ld %-3s %6u %-3s %-10s " - "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n", + "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n", proto->name, proto->obj_size, sock_prot_inuse_get(seq_file_net(seq), proto), @@ -4067,7 +4037,6 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto) proto_method_implemented(proto->getsockopt), proto_method_implemented(proto->sendmsg), proto_method_implemented(proto->recvmsg), - proto_method_implemented(proto->sendpage), proto_method_implemented(proto->bind), proto_method_implemented(proto->backlog_rcv), proto_method_implemented(proto->hash), @@ -4088,7 +4057,7 @@ static int proto_seq_show(struct seq_file *seq, void *v) "maxhdr", "slab", "module", - "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n"); + "cl co di ac io in de sh ss gs se re bi br ha uh gp em\n"); else proto_seq_printf(seq, list_entry(v, struct proto, node)); return 0; diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 3ab68415d121..fa8079303cb0 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -1010,7 +1010,6 @@ static const struct proto_ops inet_dccp_ops = { .sendmsg = inet_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct inet_protosw dccp_v4_protosw = { diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index 93c98990d726..7249ef218178 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -1087,7 +1087,6 @@ static const struct proto_ops inet6_dccp_ops = { .sendmsg = inet_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, #endif diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c index 9c124705120d..00302e8b9615 100644 --- a/net/ieee802154/socket.c +++ b/net/ieee802154/socket.c @@ -426,7 +426,6 @@ static const struct proto_ops ieee802154_raw_ops = { .sendmsg = ieee802154_sock_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; /* DGRAM Sockets (802.15.4 dataframes) */ @@ -989,7 +988,6 @@ static const struct proto_ops ieee802154_dgram_ops = { .sendmsg = ieee802154_sock_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static void ieee802154_sock_destruct(struct sock *sk) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 38e649fb4474..9b2ca2fcc5a1 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -847,23 +847,6 @@ void inet_splice_eof(struct socket *sock) } EXPORT_SYMBOL_GPL(inet_splice_eof); -ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, - size_t size, int flags) -{ - struct sock *sk = sock->sk; - const struct proto *prot; - - if (unlikely(inet_send_prepare(sk))) - return -EAGAIN; - - /* IPV6_ADDRFORM can change sk->sk_prot under us. */ - prot = READ_ONCE(sk->sk_prot); - if (prot->sendpage) - return prot->sendpage(sk, page, offset, size, flags); - return sock_no_sendpage(sock, page, offset, size, flags); -} -EXPORT_SYMBOL(inet_sendpage); - INDIRECT_CALLABLE_DECLARE(int udp_recvmsg(struct sock *, struct msghdr *, size_t, int, int *)); int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, @@ -1067,12 +1050,10 @@ const struct proto_ops inet_stream_ops = { .mmap = tcp_mmap, #endif .splice_eof = inet_splice_eof, - .sendpage = inet_sendpage, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, - .sendpage_locked = tcp_sendpage_locked, .peek_len = tcp_peek_len, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, @@ -1102,7 +1083,6 @@ const struct proto_ops inet_dgram_ops = { .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .splice_eof = inet_splice_eof, - .sendpage = inet_sendpage, .set_peek_off = sk_set_peek_off, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, @@ -1134,7 +1114,6 @@ static const struct proto_ops inet_sockraw_ops = { .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .splice_eof = inet_splice_eof, - .sendpage = inet_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, #endif diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index d56edc2c885f..e03e08745308 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -923,11 +923,10 @@ int tcp_send_mss(struct sock *sk, int *size_goal, int flags) return mss_now; } -/* In some cases, both sendpage() and sendmsg() could have added - * an skb to the write queue, but failed adding payload on it. - * We need to remove it to consume less memory, but more - * importantly be able to generate EPOLLOUT for Edge Trigger epoll() - * users. +/* In some cases, both sendmsg() could have added an skb to the write queue, + * but failed adding payload on it. We need to remove it to consume less + * memory, but more importantly be able to generate EPOLLOUT for Edge Trigger + * epoll() users. */ void tcp_remove_empty_skb(struct sock *sk) { @@ -975,40 +974,6 @@ int tcp_wmem_schedule(struct sock *sk, int copy) return min(copy, sk->sk_forward_alloc); } -int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset, - size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - - if (!(sk->sk_route_caps & NETIF_F_SG)) - return sock_no_sendpage_locked(sk, page, offset, size, flags); - - tcp_rate_check_app_limited(sk); /* is sending application-limited? */ - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - return tcp_sendmsg_locked(sk, &msg, size); -} -EXPORT_SYMBOL_GPL(tcp_sendpage_locked); - -int tcp_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags) -{ - int ret; - - lock_sock(sk); - ret = tcp_sendpage_locked(sk, page, offset, size, flags); - release_sock(sk); - - return ret; -} -EXPORT_SYMBOL(tcp_sendpage); - void tcp_free_fastopen_req(struct tcp_sock *tp) { if (tp->fastopen_req) { diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 31d6005cea9b..81f0dff69e0b 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -486,7 +486,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) long timeo; int flags; - /* Don't let internal sendpage flags through */ + /* Don't let internal flags through */ flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED); flags |= MSG_NO_SHARED_FRAGS; @@ -566,23 +566,6 @@ out_err: return copied ? copied : err; } -static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { - .msg_flags = flags | MSG_SPLICE_PAGES, - }; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - return tcp_bpf_sendmsg(sk, &msg, size); -} - enum { TCP_BPF_IPV4, TCP_BPF_IPV6, @@ -612,7 +595,6 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS], prot[TCP_BPF_TX] = prot[TCP_BPF_BASE]; prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg; - prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage; prot[TCP_BPF_RX] = prot[TCP_BPF_BASE]; prot[TCP_BPF_RX].recvmsg = tcp_bpf_recvmsg_parser; @@ -647,8 +629,7 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops) * indeed valid assumptions. */ return ops->recvmsg == tcp_recvmsg && - ops->sendmsg == tcp_sendmsg && - ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP; + ops->sendmsg == tcp_sendmsg ? 0 : -ENOTSUPP; } int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9213804b034f..fd365de4d5ff 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3117,7 +3117,6 @@ struct proto tcp_prot = { .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, .splice_eof = tcp_splice_eof, - .sendpage = tcp_sendpage, .backlog_rcv = tcp_v4_do_rcv, .release_cb = tcp_release_cb, .hash = inet_hash, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 48fdcd3cad9c..42a96b3547c9 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1340,20 +1340,6 @@ void udp_splice_eof(struct socket *sock) } EXPORT_SYMBOL_GPL(udp_splice_eof); -int udp_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES }; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return udp_sendmsg(sk, &msg, size); -} - #define UDP_SKB_IS_STATELESS 0x80000000 /* all head states (dst, sk, nf conntrack) except skb extensions are @@ -2933,7 +2919,6 @@ struct proto udp_prot = { .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .splice_eof = udp_splice_eof, - .sendpage = udp_sendpage, .release_cb = ip4_datagram_release_cb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h index 4ba7a88a1b1d..e1ff3a375996 100644 --- a/net/ipv4/udp_impl.h +++ b/net/ipv4/udp_impl.h @@ -19,8 +19,6 @@ int udp_getsockopt(struct sock *sk, int level, int optname, int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len); -int udp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, - int flags); void udp_destroy_sock(struct sock *sk); #ifdef CONFIG_PROC_FS diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c index 143f93a12f25..39ecdad1b50c 100644 --- a/net/ipv4/udplite.c +++ b/net/ipv4/udplite.c @@ -56,7 +56,6 @@ struct proto udplite_prot = { .getsockopt = udp_getsockopt, .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, - .sendpage = udp_sendpage, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .rehash = udp_v4_rehash, diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index b3451cf47d29..5d593ddc0347 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -696,9 +696,7 @@ const struct proto_ops inet6_stream_ops = { .mmap = tcp_mmap, #endif .splice_eof = inet_splice_eof, - .sendpage = inet_sendpage, .sendmsg_locked = tcp_sendmsg_locked, - .sendpage_locked = tcp_sendpage_locked, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, @@ -729,7 +727,6 @@ const struct proto_ops inet6_dgram_ops = { .recvmsg = inet6_recvmsg, /* retpoline's sake */ .read_skb = udp_read_skb, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .set_peek_off = sk_set_peek_off, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index c9caeb5a43ed..ac1cef094c5f 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1296,7 +1296,6 @@ const struct proto_ops inet6_sockraw_ops = { .sendmsg = inet_sendmsg, /* ok */ .recvmsg = sock_common_recvmsg, /* ok */ .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, #endif diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index c17c8ff94b79..40dd92a2f480 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2151,7 +2151,6 @@ struct proto tcpv6_prot = { .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, .splice_eof = tcp_splice_eof, - .sendpage = tcp_sendpage, .backlog_rcv = tcp_v6_do_rcv, .release_cb = tcp_release_cb, .hash = inet6_hash, diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c index d0537c1c8cd7..393f01b2a7e6 100644 --- a/net/kcm/kcmsock.c +++ b/net/kcm/kcmsock.c @@ -963,24 +963,6 @@ static void kcm_splice_eof(struct socket *sock) release_sock(sk); } -static ssize_t kcm_sendpage(struct socket *sock, struct page *page, - int offset, size_t size, int flags) - -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - if (flags & MSG_OOB) - return -EOPNOTSUPP; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return kcm_sendmsg(sock, &msg, size); -} - static int kcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, int flags) { @@ -1769,7 +1751,6 @@ static const struct proto_ops kcm_dgram_ops = { .recvmsg = kcm_recvmsg, .mmap = sock_no_mmap, .splice_eof = kcm_splice_eof, - .sendpage = kcm_sendpage, }; static const struct proto_ops kcm_seqpacket_ops = { @@ -1791,7 +1772,6 @@ static const struct proto_ops kcm_seqpacket_ops = { .recvmsg = kcm_recvmsg, .mmap = sock_no_mmap, .splice_eof = kcm_splice_eof, - .sendpage = kcm_sendpage, .splice_read = kcm_splice_read, }; diff --git a/net/key/af_key.c b/net/key/af_key.c index 31ab12fd720a..ede3c6a60353 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -3761,7 +3761,6 @@ static const struct proto_ops pfkey_ops = { .listen = sock_no_listen, .shutdown = sock_no_shutdown, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, /* Now the operations that really occur. */ .release = pfkey_release, diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c index 2b795c1064f5..f9073bc7281f 100644 --- a/net/l2tp/l2tp_ip.c +++ b/net/l2tp/l2tp_ip.c @@ -624,7 +624,6 @@ static const struct proto_ops l2tp_ip_ops = { .sendmsg = inet_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct inet_protosw l2tp_ip_protosw = { diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c index 5137ea1861ce..b1623f9c4f92 100644 --- a/net/l2tp/l2tp_ip6.c +++ b/net/l2tp/l2tp_ip6.c @@ -751,7 +751,6 @@ static const struct proto_ops l2tp_ip6_ops = { .sendmsg = inet_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, #endif diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 9ffbc667be6c..57c35c960b2c 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -1232,7 +1232,6 @@ static const struct proto_ops llc_ui_ops = { .sendmsg = llc_ui_sendmsg, .recvmsg = llc_ui_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static const char llc_proc_err_msg[] __initconst = diff --git a/net/mctp/af_mctp.c b/net/mctp/af_mctp.c index bb4bd0b6a4f7..f6be58b68c6f 100644 --- a/net/mctp/af_mctp.c +++ b/net/mctp/af_mctp.c @@ -485,7 +485,6 @@ static const struct proto_ops mctp_dgram_ops = { .sendmsg = mctp_sendmsg, .recvmsg = mctp_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = mctp_compat_ioctl, #endif diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index bd023debedc8..e892673deb73 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -3866,7 +3866,6 @@ static const struct proto_ops mptcp_stream_ops = { .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, - .sendpage = inet_sendpage, }; static struct inet_protosw mptcp_protosw = { @@ -3961,7 +3960,6 @@ static const struct proto_ops mptcp_v6_stream_ops = { .sendmsg = inet6_sendmsg, .recvmsg = inet6_recvmsg, .mmap = sock_no_mmap, - .sendpage = inet_sendpage, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, #endif diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index cbd9aa7ee24a..39cfb778ebc5 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2815,7 +2815,6 @@ static const struct proto_ops netlink_ops = { .sendmsg = netlink_sendmsg, .recvmsg = netlink_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static const struct net_proto_family netlink_family_ops = { diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c index 5a4cb796150f..eb8ccbd58df7 100644 --- a/net/netrom/af_netrom.c +++ b/net/netrom/af_netrom.c @@ -1364,7 +1364,6 @@ static const struct proto_ops nr_proto_ops = { .sendmsg = nr_sendmsg, .recvmsg = nr_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct notifier_block nr_dev_notifier = { diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a2dbeb264f26..85ff90a03b0c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -4621,7 +4621,6 @@ static const struct proto_ops packet_ops_spkt = { .sendmsg = packet_sendmsg_spkt, .recvmsg = packet_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static const struct proto_ops packet_ops = { @@ -4643,7 +4642,6 @@ static const struct proto_ops packet_ops = { .sendmsg = packet_sendmsg, .recvmsg = packet_recvmsg, .mmap = packet_mmap, - .sendpage = sock_no_sendpage, }; static const struct net_proto_family packet_family_ops = { diff --git a/net/phonet/socket.c b/net/phonet/socket.c index 967f9b4dc026..1018340d89a7 100644 --- a/net/phonet/socket.c +++ b/net/phonet/socket.c @@ -441,7 +441,6 @@ const struct proto_ops phonet_dgram_ops = { .sendmsg = pn_socket_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; const struct proto_ops phonet_stream_ops = { @@ -462,7 +461,6 @@ const struct proto_ops phonet_stream_ops = { .sendmsg = pn_socket_sendmsg, .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; EXPORT_SYMBOL(phonet_stream_ops); diff --git a/net/qrtr/af_qrtr.c b/net/qrtr/af_qrtr.c index 76f0434d3d06..78beb74146e7 100644 --- a/net/qrtr/af_qrtr.c +++ b/net/qrtr/af_qrtr.c @@ -1244,7 +1244,6 @@ static const struct proto_ops qrtr_proto_ops = { .shutdown = sock_no_shutdown, .release = qrtr_release, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto qrtr_proto = { diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index 3ff6995244e5..01c4cdfef45d 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -653,7 +653,6 @@ static const struct proto_ops rds_proto_ops = { .sendmsg = rds_sendmsg, .recvmsg = rds_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static void rds_sock_destruct(struct sock *sk) diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c index ca2b17f32670..49dafe9ac72f 100644 --- a/net/rose/af_rose.c +++ b/net/rose/af_rose.c @@ -1496,7 +1496,6 @@ static const struct proto_ops rose_proto_ops = { .sendmsg = rose_sendmsg, .recvmsg = rose_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct notifier_block rose_dev_notifier = { diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index da0b3b5157d5..f2cf4aa99db2 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -954,7 +954,6 @@ static const struct proto_ops rxrpc_rpc_ops = { .sendmsg = rxrpc_sendmsg, .recvmsg = rxrpc_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct proto rxrpc_proto = { diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 664d1f2e9121..274d07bd774f 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1133,7 +1133,6 @@ static const struct proto_ops inet_seqpacket_ops = { .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; /* Registration with AF_INET family. */ diff --git a/net/socket.c b/net/socket.c index b778fc03c6e0..8c3c8b29995a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -3552,54 +3552,6 @@ int kernel_getpeername(struct socket *sock, struct sockaddr *addr) } EXPORT_SYMBOL(kernel_getpeername); -/** - * kernel_sendpage - send a &page through a socket (kernel space) - * @sock: socket - * @page: page - * @offset: page offset - * @size: total size in bytes - * @flags: flags (MSG_DONTWAIT, ...) - * - * Returns the total amount sent in bytes or an error. - */ - -int kernel_sendpage(struct socket *sock, struct page *page, int offset, - size_t size, int flags) -{ - if (sock->ops->sendpage) { - /* Warn in case the improper page to zero-copy send */ - WARN_ONCE(!sendpage_ok(page), "improper page for zero-copy send"); - return sock->ops->sendpage(sock, page, offset, size, flags); - } - return sock_no_sendpage(sock, page, offset, size, flags); -} -EXPORT_SYMBOL(kernel_sendpage); - -/** - * kernel_sendpage_locked - send a &page through the locked sock (kernel space) - * @sk: sock - * @page: page - * @offset: page offset - * @size: total size in bytes - * @flags: flags (MSG_DONTWAIT, ...) - * - * Returns the total amount sent in bytes or an error. - * Caller must hold @sk. - */ - -int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset, - size_t size, int flags) -{ - struct socket *sock = sk->sk_socket; - - if (sock->ops->sendpage_locked) - return sock->ops->sendpage_locked(sk, page, offset, size, - flags); - - return sock_no_sendpage_locked(sk, page, offset, size, flags); -} -EXPORT_SYMBOL(kernel_sendpage_locked); - /** * kernel_sock_shutdown - shut down part of a full-duplex connection (kernel space) * @sock: socket diff --git a/net/tipc/socket.c b/net/tipc/socket.c index dd73d71c02a9..ef8e5139a873 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -3375,7 +3375,6 @@ static const struct proto_ops msg_ops = { .sendmsg = tipc_sendmsg, .recvmsg = tipc_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage }; static const struct proto_ops packet_ops = { @@ -3396,7 +3395,6 @@ static const struct proto_ops packet_ops = { .sendmsg = tipc_send_packet, .recvmsg = tipc_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage }; static const struct proto_ops stream_ops = { @@ -3417,7 +3415,6 @@ static const struct proto_ops stream_ops = { .sendmsg = tipc_sendstream, .recvmsg = tipc_recvstream, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage }; static const struct net_proto_family tipc_family_ops = { diff --git a/net/tls/tls.h b/net/tls/tls.h index d002c3af1966..86cef1c68e03 100644 --- a/net/tls/tls.h +++ b/net/tls/tls.h @@ -98,10 +98,6 @@ void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx); void tls_sw_strparser_done(struct tls_context *tls_ctx); int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); void tls_sw_splice_eof(struct socket *sock); -int tls_sw_sendpage_locked(struct sock *sk, struct page *page, - int offset, size_t size, int flags); -int tls_sw_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags); void tls_sw_cancel_work_tx(struct tls_context *tls_ctx); void tls_sw_release_resources_tx(struct sock *sk); void tls_sw_free_ctx_tx(struct tls_context *tls_ctx); @@ -117,8 +113,6 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); void tls_device_splice_eof(struct socket *sock); -int tls_device_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags); int tls_tx_records(struct sock *sk, int flags); void tls_sw_write_space(struct sock *sk, struct tls_context *ctx); diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index 975299d7213b..840ee06f1708 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -621,23 +621,6 @@ void tls_device_splice_eof(struct socket *sock) mutex_unlock(&tls_ctx->tx_lock); } -int tls_device_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - if (flags & MSG_OOB) - return -EOPNOTSUPP; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return tls_device_sendmsg(sk, &msg, size); -} - struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context, u32 seq, u64 *p_record_sn) { diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 7b9c83dd7de2..d5ed4d47b16e 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -958,7 +958,6 @@ static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG] ops[TLS_SW ][TLS_BASE] = ops[TLS_BASE][TLS_BASE]; ops[TLS_SW ][TLS_BASE].splice_eof = tls_sw_splice_eof; - ops[TLS_SW ][TLS_BASE].sendpage_locked = tls_sw_sendpage_locked; ops[TLS_BASE][TLS_SW ] = ops[TLS_BASE][TLS_BASE]; ops[TLS_BASE][TLS_SW ].splice_read = tls_sw_splice_read; @@ -970,17 +969,14 @@ static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG] #ifdef CONFIG_TLS_DEVICE ops[TLS_HW ][TLS_BASE] = ops[TLS_BASE][TLS_BASE]; - ops[TLS_HW ][TLS_BASE].sendpage_locked = NULL; ops[TLS_HW ][TLS_SW ] = ops[TLS_BASE][TLS_SW ]; - ops[TLS_HW ][TLS_SW ].sendpage_locked = NULL; ops[TLS_BASE][TLS_HW ] = ops[TLS_BASE][TLS_SW ]; ops[TLS_SW ][TLS_HW ] = ops[TLS_SW ][TLS_SW ]; ops[TLS_HW ][TLS_HW ] = ops[TLS_HW ][TLS_SW ]; - ops[TLS_HW ][TLS_HW ].sendpage_locked = NULL; #endif #ifdef CONFIG_TLS_TOE ops[TLS_HW_RECORD][TLS_HW_RECORD] = *base; @@ -1029,7 +1025,6 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG], prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE]; prot[TLS_SW][TLS_BASE].sendmsg = tls_sw_sendmsg; prot[TLS_SW][TLS_BASE].splice_eof = tls_sw_splice_eof; - prot[TLS_SW][TLS_BASE].sendpage = tls_sw_sendpage; prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE]; prot[TLS_BASE][TLS_SW].recvmsg = tls_sw_recvmsg; @@ -1045,12 +1040,10 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG], prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE]; prot[TLS_HW][TLS_BASE].sendmsg = tls_device_sendmsg; prot[TLS_HW][TLS_BASE].splice_eof = tls_device_splice_eof; - prot[TLS_HW][TLS_BASE].sendpage = tls_device_sendpage; prot[TLS_HW][TLS_SW] = prot[TLS_BASE][TLS_SW]; prot[TLS_HW][TLS_SW].sendmsg = tls_device_sendmsg; prot[TLS_HW][TLS_SW].splice_eof = tls_device_splice_eof; - prot[TLS_HW][TLS_SW].sendpage = tls_device_sendpage; prot[TLS_BASE][TLS_HW] = prot[TLS_BASE][TLS_SW]; diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 319f61590d2c..9b3aa89a4292 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1281,41 +1281,6 @@ unlock: mutex_unlock(&tls_ctx->tx_lock); } -int tls_sw_sendpage_locked(struct sock *sk, struct page *page, - int offset, size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - - if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | - MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY | - MSG_NO_SHARED_FRAGS)) - return -EOPNOTSUPP; - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return tls_sw_sendmsg_locked(sk, &msg, size); -} - -int tls_sw_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, }; - - if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | - MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY)) - return -EOPNOTSUPP; - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return tls_sw_sendmsg(sk, &msg, size); -} - static int tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, bool released) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f9d196439b49..f2f234f0b92c 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -758,8 +758,6 @@ static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon static int unix_shutdown(struct socket *, int); static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t); static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int); -static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset, - size_t size, int flags); static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos, struct pipe_inode_info *, size_t size, unsigned int flags); @@ -852,7 +850,6 @@ static const struct proto_ops unix_stream_ops = { .recvmsg = unix_stream_recvmsg, .read_skb = unix_stream_read_skb, .mmap = sock_no_mmap, - .sendpage = unix_stream_sendpage, .splice_read = unix_stream_splice_read, .set_peek_off = unix_set_peek_off, .show_fdinfo = unix_show_fdinfo, @@ -878,7 +875,6 @@ static const struct proto_ops unix_dgram_ops = { .read_skb = unix_read_skb, .recvmsg = unix_dgram_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .set_peek_off = unix_set_peek_off, .show_fdinfo = unix_show_fdinfo, }; @@ -902,7 +898,6 @@ static const struct proto_ops unix_seqpacket_ops = { .sendmsg = unix_seqpacket_sendmsg, .recvmsg = unix_seqpacket_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .set_peek_off = unix_set_peek_off, .show_fdinfo = unix_show_fdinfo, }; @@ -2294,20 +2289,6 @@ out_err: return sent ? : err; } -static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page, - int offset, size_t size, int flags) -{ - struct bio_vec bvec; - struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES }; - - if (flags & MSG_SENDPAGE_NOTLAST) - msg.msg_flags |= MSG_MORE; - - bvec_set_page(&bvec, page, size, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - return unix_stream_sendmsg(socket, &msg, size); -} - static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index efb8a0937a13..020cf17ab7e4 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1306,7 +1306,6 @@ static const struct proto_ops vsock_dgram_ops = { .sendmsg = vsock_dgram_sendmsg, .recvmsg = vsock_dgram_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .read_skb = vsock_read_skb, }; @@ -2234,7 +2233,6 @@ static const struct proto_ops vsock_stream_ops = { .sendmsg = vsock_connectible_sendmsg, .recvmsg = vsock_connectible_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .set_rcvlowat = vsock_set_rcvlowat, .read_skb = vsock_read_skb, }; @@ -2257,7 +2255,6 @@ static const struct proto_ops vsock_seqpacket_ops = { .sendmsg = vsock_connectible_sendmsg, .recvmsg = vsock_connectible_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, .read_skb = vsock_read_skb, }; diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c index 5c7ad301d742..0fb5143bec7a 100644 --- a/net/x25/af_x25.c +++ b/net/x25/af_x25.c @@ -1757,7 +1757,6 @@ static const struct proto_ops x25_proto_ops = { .sendmsg = x25_sendmsg, .recvmsg = x25_recvmsg, .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage, }; static struct packet_type x25_packet_type __read_mostly = { diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index cc1e7f15fa73..5a8c0dd250af 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -1389,7 +1389,6 @@ static const struct proto_ops xsk_proto_ops = { .sendmsg = xsk_sendmsg, .recvmsg = xsk_recvmsg, .mmap = xsk_mmap, - .sendpage = sock_no_sendpage, }; static void xsk_destruct(struct sock *sk) -- cgit v1.2.3