From 4d496be9ca05d22dbbbba78368ac4bb05bcc67f2 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 10 Jul 2023 13:30:27 -0500 Subject: bpf,docs: Create new standardization subdirectory The BPF standardization effort is actively underway with the IETF. As described in the BPF Working Group (WG) charter in [0], there are a number of proposed documents, some informational and some proposed standards, that will be drafted as part of the standardization effort. [0]: https://datatracker.ietf.org/wg/bpf/about/ Though the specific documents that will formally be standardized will exist as Internet Drafts (I-D) and WG documents in the BPF WG datatracker page, the source of truth from where those documents will be generated will reside in the kernel documentation tree (originating in the bpf-next tree). Because these documents will be used to generate the I-D and WG documents which will be standardized with the IETF, they are a bit special as far as kernel-tree documentation goes: - They will be dual licensed with LGPL-2.1 OR BSD-2-Clause - IETF I-D and WG documents (the documents which will actually be standardized) will be auto-generated from these documents. In order to keep things clearly organized in the BPF documentation tree, and to make it abundantly clear where standards-related documentation needs to go, we should move standards-relevant documents into a separate standardization/ subdirectory. Signed-off-by: David Vernet Link: https://lore.kernel.org/r/20230710183027.15132-1-void@manifault.com Signed-off-by: Alexei Starovoitov --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index acbe54087d1c..99d8dc9b2850 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3746,7 +3746,7 @@ R: David Vernet L: bpf@vger.kernel.org L: bpf@ietf.org S: Maintained -F: Documentation/bpf/instruction-set.rst +F: Documentation/bpf/standardization/ BPF [GENERAL] (Safe Dynamic Programs and Tools) M: Alexei Starovoitov -- cgit v1.2.3 From 3abf3d15ffff62f906e6c7626efb69d136141556 Mon Sep 17 00:00:00 2001 From: Justin Chen Date: Thu, 13 Jul 2023 15:19:06 -0700 Subject: MAINTAINERS: ASP 2.0 Ethernet driver maintainers Add maintainers entry for ASP 2.0 Ethernet driver. Reviewed-by: Simon Horman Signed-off-by: Florian Fainelli Signed-off-by: Justin Chen Signed-off-by: David S. Miller --- MAINTAINERS | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index d1fc5d26c699..9a5863f1b016 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4145,6 +4145,15 @@ F: drivers/net/mdio/mdio-bcm-unimac.c F: include/linux/platform_data/bcmgenet.h F: include/linux/platform_data/mdio-bcm-unimac.h +BROADCOM ASP 2.0 ETHERNET DRIVER +M: Justin Chen +M: Florian Fainelli +L: bcm-kernel-feedback-list@broadcom.com +L: netdev@vger.kernel.org +S: Supported +F: Documentation/devicetree/bindings/net/brcm,asp-v2.0.yaml +F: drivers/net/ethernet/broadcom/asp2/ + BROADCOM IPROC ARM ARCHITECTURE M: Ray Jui M: Scott Branden -- cgit v1.2.3 From 053c8e1f235dc3f69d13375b32f4209228e1cb96 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 19 Jul 2023 16:08:51 +0200 Subject: bpf: Add generic attach/detach/query API for multi-progs This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov --- MAINTAINERS | 1 + include/linux/bpf_mprog.h | 318 +++++++++++++++++++++++++++++ include/uapi/linux/bpf.h | 36 +++- kernel/bpf/Makefile | 2 +- kernel/bpf/mprog.c | 445 +++++++++++++++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 36 +++- 6 files changed, 821 insertions(+), 17 deletions(-) create mode 100644 include/linux/bpf_mprog.h create mode 100644 kernel/bpf/mprog.c (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 9a5863f1b016..678bef9f60b4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3684,6 +3684,7 @@ F: include/linux/filter.h F: include/linux/tnum.h F: kernel/bpf/core.c F: kernel/bpf/dispatcher.c +F: kernel/bpf/mprog.c F: kernel/bpf/syscall.c F: kernel/bpf/tnum.c F: kernel/bpf/trampoline.c diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h new file mode 100644 index 000000000000..6feefec43422 --- /dev/null +++ b/include/linux/bpf_mprog.h @@ -0,0 +1,318 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2023 Isovalent */ +#ifndef __BPF_MPROG_H +#define __BPF_MPROG_H + +#include + +/* bpf_mprog framework: + * + * bpf_mprog is a generic layer for multi-program attachment. In-kernel users + * of the bpf_mprog don't need to care about the dependency resolution + * internals, they can just consume it with few API calls. Currently available + * dependency directives are BPF_F_{BEFORE,AFTER} which enable insertion of + * a BPF program or BPF link relative to an existing BPF program or BPF link + * inside the multi-program array as well as prepend and append behavior if + * no relative object was specified, see corresponding selftests for concrete + * examples (e.g. tc_links and tc_opts test cases of test_progs). + * + * Usage of bpf_mprog_{attach,detach,query}() core APIs with pseudo code: + * + * Attach case: + * + * struct bpf_mprog_entry *entry, *entry_new; + * int ret; + * + * // bpf_mprog user-side lock + * // fetch active @entry from attach location + * [...] + * ret = bpf_mprog_attach(entry, &entry_new, [...]); + * if (!ret) { + * if (entry != entry_new) { + * // swap @entry to @entry_new at attach location + * // ensure there are no inflight users of @entry: + * synchronize_rcu(); + * } + * bpf_mprog_commit(entry); + * } else { + * // error path, bail out, propagate @ret + * } + * // bpf_mprog user-side unlock + * + * Detach case: + * + * struct bpf_mprog_entry *entry, *entry_new; + * int ret; + * + * // bpf_mprog user-side lock + * // fetch active @entry from attach location + * [...] + * ret = bpf_mprog_detach(entry, &entry_new, [...]); + * if (!ret) { + * // all (*) marked is optional and depends on the use-case + * // whether bpf_mprog_bundle should be freed or not + * if (!bpf_mprog_total(entry_new)) (*) + * entry_new = NULL (*) + * // swap @entry to @entry_new at attach location + * // ensure there are no inflight users of @entry: + * synchronize_rcu(); + * bpf_mprog_commit(entry); + * if (!entry_new) (*) + * // free bpf_mprog_bundle (*) + * } else { + * // error path, bail out, propagate @ret + * } + * // bpf_mprog user-side unlock + * + * Query case: + * + * struct bpf_mprog_entry *entry; + * int ret; + * + * // bpf_mprog user-side lock + * // fetch active @entry from attach location + * [...] + * ret = bpf_mprog_query(attr, uattr, entry); + * // bpf_mprog user-side unlock + * + * Data/fast path: + * + * struct bpf_mprog_entry *entry; + * struct bpf_mprog_fp *fp; + * struct bpf_prog *prog; + * int ret = [...]; + * + * rcu_read_lock(); + * // fetch active @entry from attach location + * [...] + * bpf_mprog_foreach_prog(entry, fp, prog) { + * ret = bpf_prog_run(prog, [...]); + * // process @ret from program + * } + * [...] + * rcu_read_unlock(); + * + * bpf_mprog locking considerations: + * + * bpf_mprog_{attach,detach,query}() must be protected by an external lock + * (like RTNL in case of tcx). + * + * bpf_mprog_entry pointer can be an __rcu annotated pointer (in case of tcx + * the netdevice has tcx_ingress and tcx_egress __rcu pointer) which gets + * updated via rcu_assign_pointer() pointing to the active bpf_mprog_entry of + * the bpf_mprog_bundle. + * + * Fast path accesses the active bpf_mprog_entry within RCU critical section + * (in case of tcx it runs in NAPI which provides RCU protection there, + * other users might need explicit rcu_read_lock()). The bpf_mprog_commit() + * assumes that for the old bpf_mprog_entry there are no inflight users + * anymore. + * + * The READ_ONCE()/WRITE_ONCE() pairing for bpf_mprog_fp's prog access is for + * the replacement case where we don't swap the bpf_mprog_entry. + */ + +#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \ + for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\ + ({ \ + t.prog = READ_ONCE(fp->prog); \ + t.link = cp->link; \ + t.prog; \ + }); \ + fp++, cp++) + +#define bpf_mprog_foreach_prog(entry, fp, p) \ + for (fp = &entry->fp_items[0]; \ + (p = READ_ONCE(fp->prog)); \ + fp++) + +#define BPF_MPROG_MAX 64 + +struct bpf_mprog_fp { + struct bpf_prog *prog; +}; + +struct bpf_mprog_cp { + struct bpf_link *link; +}; + +struct bpf_mprog_entry { + struct bpf_mprog_fp fp_items[BPF_MPROG_MAX]; + struct bpf_mprog_bundle *parent; +}; + +struct bpf_mprog_bundle { + struct bpf_mprog_entry a; + struct bpf_mprog_entry b; + struct bpf_mprog_cp cp_items[BPF_MPROG_MAX]; + struct bpf_prog *ref; + atomic64_t revision; + u32 count; +}; + +struct bpf_tuple { + struct bpf_prog *prog; + struct bpf_link *link; +}; + +static inline struct bpf_mprog_entry * +bpf_mprog_peer(const struct bpf_mprog_entry *entry) +{ + if (entry == &entry->parent->a) + return &entry->parent->b; + else + return &entry->parent->a; +} + +static inline void bpf_mprog_bundle_init(struct bpf_mprog_bundle *bundle) +{ + BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64)); + BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != + ARRAY_SIZE(bundle->cp_items)); + + memset(bundle, 0, sizeof(*bundle)); + atomic64_set(&bundle->revision, 1); + bundle->a.parent = bundle; + bundle->b.parent = bundle; +} + +static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry) +{ + entry->parent->count++; +} + +static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry) +{ + entry->parent->count--; +} + +static inline int bpf_mprog_max(void) +{ + return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1; +} + +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry) +{ + int total = entry->parent->count; + + WARN_ON_ONCE(total > bpf_mprog_max()); + return total; +} + +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry, + struct bpf_prog *prog) +{ + const struct bpf_mprog_fp *fp; + const struct bpf_prog *tmp; + + bpf_mprog_foreach_prog(entry, fp, tmp) { + if (tmp == prog) + return true; + } + return false; +} + +static inline void bpf_mprog_mark_for_release(struct bpf_mprog_entry *entry, + struct bpf_tuple *tuple) +{ + WARN_ON_ONCE(entry->parent->ref); + if (!tuple->link) + entry->parent->ref = tuple->prog; +} + +static inline void bpf_mprog_complete_release(struct bpf_mprog_entry *entry) +{ + /* In the non-link case prog deletions can only drop the reference + * to the prog after the bpf_mprog_entry got swapped and the + * bpf_mprog ensured that there are no inflight users anymore. + * + * Paired with bpf_mprog_mark_for_release(). + */ + if (entry->parent->ref) { + bpf_prog_put(entry->parent->ref); + entry->parent->ref = NULL; + } +} + +static inline void bpf_mprog_revision_new(struct bpf_mprog_entry *entry) +{ + atomic64_inc(&entry->parent->revision); +} + +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry) +{ + bpf_mprog_complete_release(entry); + bpf_mprog_revision_new(entry); +} + +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry) +{ + return atomic64_read(&entry->parent->revision); +} + +static inline void bpf_mprog_entry_copy(struct bpf_mprog_entry *dst, + struct bpf_mprog_entry *src) +{ + memcpy(dst->fp_items, src->fp_items, sizeof(src->fp_items)); +} + +static inline void bpf_mprog_entry_grow(struct bpf_mprog_entry *entry, int idx) +{ + int total = bpf_mprog_total(entry); + + memmove(entry->fp_items + idx + 1, + entry->fp_items + idx, + (total - idx) * sizeof(struct bpf_mprog_fp)); + + memmove(entry->parent->cp_items + idx + 1, + entry->parent->cp_items + idx, + (total - idx) * sizeof(struct bpf_mprog_cp)); +} + +static inline void bpf_mprog_entry_shrink(struct bpf_mprog_entry *entry, int idx) +{ + /* Total array size is needed in this case to enure the NULL + * entry is copied at the end. + */ + int total = ARRAY_SIZE(entry->fp_items); + + memmove(entry->fp_items + idx, + entry->fp_items + idx + 1, + (total - idx - 1) * sizeof(struct bpf_mprog_fp)); + + memmove(entry->parent->cp_items + idx, + entry->parent->cp_items + idx + 1, + (total - idx - 1) * sizeof(struct bpf_mprog_cp)); +} + +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 idx, + struct bpf_mprog_fp **fp, + struct bpf_mprog_cp **cp) +{ + *fp = &entry->fp_items[idx]; + *cp = &entry->parent->cp_items[idx]; +} + +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp, + struct bpf_mprog_cp *cp, + struct bpf_tuple *tuple) +{ + WRITE_ONCE(fp->prog, tuple->prog); + cp->link = tuple->link; +} + +int bpf_mprog_attach(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_prog *prog_new, struct bpf_link *link, + struct bpf_prog *prog_old, + u32 flags, u32 id_or_fd, u64 revision); + +int bpf_mprog_detach(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_prog *prog, struct bpf_link *link, + u32 flags, u32 id_or_fd, u64 revision); + +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, + struct bpf_mprog_entry *entry); + +#endif /* __BPF_MPROG_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9ed59896ebc5..d4c07e435336 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1113,7 +1113,12 @@ enum bpf_perf_event_type { */ #define BPF_F_ALLOW_OVERRIDE (1U << 0) #define BPF_F_ALLOW_MULTI (1U << 1) +/* Generic attachment flags. */ #define BPF_F_REPLACE (1U << 2) +#define BPF_F_BEFORE (1U << 3) +#define BPF_F_AFTER (1U << 4) +#define BPF_F_ID (1U << 5) +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */ /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the * verifier will perform strict alignment checking as if the kernel @@ -1444,14 +1449,19 @@ union bpf_attr { }; struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ - __u32 target_fd; /* container object to attach to */ - __u32 attach_bpf_fd; /* eBPF program to attach */ + union { + __u32 target_fd; /* target object to attach to or ... */ + __u32 target_ifindex; /* target ifindex */ + }; + __u32 attach_bpf_fd; __u32 attach_type; __u32 attach_flags; - __u32 replace_bpf_fd; /* previously attached eBPF - * program to replace if - * BPF_F_REPLACE is used - */ + __u32 replace_bpf_fd; + union { + __u32 relative_fd; + __u32 relative_id; + }; + __u64 expected_revision; }; struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ @@ -1497,16 +1507,26 @@ union bpf_attr { } info; struct { /* anonymous struct used by BPF_PROG_QUERY command */ - __u32 target_fd; /* container object to query */ + union { + __u32 target_fd; /* target object to query or ... */ + __u32 target_ifindex; /* target ifindex */ + }; __u32 attach_type; __u32 query_flags; __u32 attach_flags; __aligned_u64 prog_ids; - __u32 prog_cnt; + union { + __u32 prog_cnt; + __u32 count; + }; + __u32 :32; /* output: per-program attach_flags. * not allowed to be set during effective query. */ __aligned_u64 prog_attach_flags; + __aligned_u64 link_ids; + __aligned_u64 link_attach_flags; + __u64 revision; } query; struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 1d3892168d32..1bea2eb912cd 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o -obj-$(CONFIG_BPF_SYSCALL) += disasm.o +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o obj-$(CONFIG_BPF_JIT) += trampoline.o obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o obj-$(CONFIG_BPF_JIT) += dispatcher.o diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c new file mode 100644 index 000000000000..f7816d2bc3e4 --- /dev/null +++ b/kernel/bpf/mprog.c @@ -0,0 +1,445 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Isovalent */ + +#include +#include + +static int bpf_mprog_link(struct bpf_tuple *tuple, + u32 id_or_fd, u32 flags, + enum bpf_prog_type type) +{ + struct bpf_link *link = ERR_PTR(-EINVAL); + bool id = flags & BPF_F_ID; + + if (id) + link = bpf_link_by_id(id_or_fd); + else if (id_or_fd) + link = bpf_link_get_from_fd(id_or_fd); + if (IS_ERR(link)) + return PTR_ERR(link); + if (type && link->prog->type != type) { + bpf_link_put(link); + return -EINVAL; + } + + tuple->link = link; + tuple->prog = link->prog; + return 0; +} + +static int bpf_mprog_prog(struct bpf_tuple *tuple, + u32 id_or_fd, u32 flags, + enum bpf_prog_type type) +{ + struct bpf_prog *prog = ERR_PTR(-EINVAL); + bool id = flags & BPF_F_ID; + + if (id) + prog = bpf_prog_by_id(id_or_fd); + else if (id_or_fd) + prog = bpf_prog_get(id_or_fd); + if (IS_ERR(prog)) + return PTR_ERR(prog); + if (type && prog->type != type) { + bpf_prog_put(prog); + return -EINVAL; + } + + tuple->link = NULL; + tuple->prog = prog; + return 0; +} + +static int bpf_mprog_tuple_relative(struct bpf_tuple *tuple, + u32 id_or_fd, u32 flags, + enum bpf_prog_type type) +{ + bool link = flags & BPF_F_LINK; + bool id = flags & BPF_F_ID; + + memset(tuple, 0, sizeof(*tuple)); + if (link) + return bpf_mprog_link(tuple, id_or_fd, flags, type); + /* If no relevant flag is set and no id_or_fd was passed, then + * tuple link/prog is just NULLed. This is the case when before/ + * after selects first/last position without passing fd. + */ + if (!id && !id_or_fd) + return 0; + return bpf_mprog_prog(tuple, id_or_fd, flags, type); +} + +static void bpf_mprog_tuple_put(struct bpf_tuple *tuple) +{ + if (tuple->link) + bpf_link_put(tuple->link); + else if (tuple->prog) + bpf_prog_put(tuple->prog); +} + +/* The bpf_mprog_{replace,delete}() operate on exact idx position with the + * one exception that for deletion we support delete from front/back. In + * case of front idx is -1, in case of back idx is bpf_mprog_total(entry). + * Adjustment to first and last entry is trivial. The bpf_mprog_insert() + * we have to deal with the following cases: + * + * idx + before: + * + * Insert P4 before P3: idx for old array is 1, idx for new array is 2, + * hence we adjust target idx for the new array, so that memmove copies + * P1 and P2 to the new entry, and we insert P4 into idx 2. Inserting + * before P1 would have old idx -1 and new idx 0. + * + * +--+--+--+ +--+--+--+--+ +--+--+--+--+ + * |P1|P2|P3| ==> |P1|P2| |P3| ==> |P1|P2|P4|P3| + * +--+--+--+ +--+--+--+--+ +--+--+--+--+ + * + * idx + after: + * + * Insert P4 after P2: idx for old array is 2, idx for new array is 2. + * Again, memmove copies P1 and P2 to the new entry, and we insert P4 + * into idx 2. Inserting after P3 would have both old/new idx at 4 aka + * bpf_mprog_total(entry). + * + * +--+--+--+ +--+--+--+--+ +--+--+--+--+ + * |P1|P2|P3| ==> |P1|P2| |P3| ==> |P1|P2|P4|P3| + * +--+--+--+ +--+--+--+--+ +--+--+--+--+ + */ +static int bpf_mprog_replace(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_tuple *ntuple, int idx) +{ + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + struct bpf_prog *oprog; + + bpf_mprog_read(entry, idx, &fp, &cp); + oprog = READ_ONCE(fp->prog); + bpf_mprog_write(fp, cp, ntuple); + if (!ntuple->link) { + WARN_ON_ONCE(cp->link); + bpf_prog_put(oprog); + } + *entry_new = entry; + return 0; +} + +static int bpf_mprog_insert(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_tuple *ntuple, int idx, u32 flags) +{ + int total = bpf_mprog_total(entry); + struct bpf_mprog_entry *peer; + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + + peer = bpf_mprog_peer(entry); + bpf_mprog_entry_copy(peer, entry); + if (idx == total) + goto insert; + else if (flags & BPF_F_BEFORE) + idx += 1; + bpf_mprog_entry_grow(peer, idx); +insert: + bpf_mprog_read(peer, idx, &fp, &cp); + bpf_mprog_write(fp, cp, ntuple); + bpf_mprog_inc(peer); + *entry_new = peer; + return 0; +} + +static int bpf_mprog_delete(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_tuple *dtuple, int idx) +{ + int total = bpf_mprog_total(entry); + struct bpf_mprog_entry *peer; + + peer = bpf_mprog_peer(entry); + bpf_mprog_entry_copy(peer, entry); + if (idx == -1) + idx = 0; + else if (idx == total) + idx = total - 1; + bpf_mprog_entry_shrink(peer, idx); + bpf_mprog_dec(peer); + bpf_mprog_mark_for_release(peer, dtuple); + *entry_new = peer; + return 0; +} + +/* In bpf_mprog_pos_*() we evaluate the target position for the BPF + * program/link that needs to be replaced, inserted or deleted for + * each "rule" independently. If all rules agree on that position + * or existing element, then enact replacement, addition or deletion. + * If this is not the case, then the request cannot be satisfied and + * we bail out with an error. + */ +static int bpf_mprog_pos_exact(struct bpf_mprog_entry *entry, + struct bpf_tuple *tuple) +{ + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + int i; + + for (i = 0; i < bpf_mprog_total(entry); i++) { + bpf_mprog_read(entry, i, &fp, &cp); + if (tuple->prog == READ_ONCE(fp->prog)) + return tuple->link == cp->link ? i : -EBUSY; + } + return -ENOENT; +} + +static int bpf_mprog_pos_before(struct bpf_mprog_entry *entry, + struct bpf_tuple *tuple) +{ + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + int i; + + for (i = 0; i < bpf_mprog_total(entry); i++) { + bpf_mprog_read(entry, i, &fp, &cp); + if (tuple->prog == READ_ONCE(fp->prog) && + (!tuple->link || tuple->link == cp->link)) + return i - 1; + } + return tuple->prog ? -ENOENT : -1; +} + +static int bpf_mprog_pos_after(struct bpf_mprog_entry *entry, + struct bpf_tuple *tuple) +{ + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + int i; + + for (i = 0; i < bpf_mprog_total(entry); i++) { + bpf_mprog_read(entry, i, &fp, &cp); + if (tuple->prog == READ_ONCE(fp->prog) && + (!tuple->link || tuple->link == cp->link)) + return i + 1; + } + return tuple->prog ? -ENOENT : bpf_mprog_total(entry); +} + +int bpf_mprog_attach(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_prog *prog_new, struct bpf_link *link, + struct bpf_prog *prog_old, + u32 flags, u32 id_or_fd, u64 revision) +{ + struct bpf_tuple rtuple, ntuple = { + .prog = prog_new, + .link = link, + }, otuple = { + .prog = prog_old, + .link = link, + }; + int ret, idx = -ERANGE, tidx; + + if (revision && revision != bpf_mprog_revision(entry)) + return -ESTALE; + if (bpf_mprog_exists(entry, prog_new)) + return -EEXIST; + ret = bpf_mprog_tuple_relative(&rtuple, id_or_fd, + flags & ~BPF_F_REPLACE, + prog_new->type); + if (ret) + return ret; + if (flags & BPF_F_REPLACE) { + tidx = bpf_mprog_pos_exact(entry, &otuple); + if (tidx < 0) { + ret = tidx; + goto out; + } + idx = tidx; + } + if (flags & BPF_F_BEFORE) { + tidx = bpf_mprog_pos_before(entry, &rtuple); + if (tidx < -1 || (idx >= -1 && tidx != idx)) { + ret = tidx < -1 ? tidx : -ERANGE; + goto out; + } + idx = tidx; + } + if (flags & BPF_F_AFTER) { + tidx = bpf_mprog_pos_after(entry, &rtuple); + if (tidx < -1 || (idx >= -1 && tidx != idx)) { + ret = tidx < 0 ? tidx : -ERANGE; + goto out; + } + idx = tidx; + } + if (idx < -1) { + if (rtuple.prog || flags) { + ret = -EINVAL; + goto out; + } + idx = bpf_mprog_total(entry); + flags = BPF_F_AFTER; + } + if (idx >= bpf_mprog_max()) { + ret = -ERANGE; + goto out; + } + if (flags & BPF_F_REPLACE) + ret = bpf_mprog_replace(entry, entry_new, &ntuple, idx); + else + ret = bpf_mprog_insert(entry, entry_new, &ntuple, idx, flags); +out: + bpf_mprog_tuple_put(&rtuple); + return ret; +} + +static int bpf_mprog_fetch(struct bpf_mprog_entry *entry, + struct bpf_tuple *tuple, int idx) +{ + int total = bpf_mprog_total(entry); + struct bpf_mprog_cp *cp; + struct bpf_mprog_fp *fp; + struct bpf_prog *prog; + struct bpf_link *link; + + if (idx == -1) + idx = 0; + else if (idx == total) + idx = total - 1; + bpf_mprog_read(entry, idx, &fp, &cp); + prog = READ_ONCE(fp->prog); + link = cp->link; + /* The deletion request can either be without filled tuple in which + * case it gets populated here based on idx, or with filled tuple + * where the only thing we end up doing is the WARN_ON_ONCE() assert. + * If we hit a BPF link at the given index, it must not be removed + * from opts path. + */ + if (link && !tuple->link) + return -EBUSY; + WARN_ON_ONCE(tuple->prog && tuple->prog != prog); + WARN_ON_ONCE(tuple->link && tuple->link != link); + tuple->prog = prog; + tuple->link = link; + return 0; +} + +int bpf_mprog_detach(struct bpf_mprog_entry *entry, + struct bpf_mprog_entry **entry_new, + struct bpf_prog *prog, struct bpf_link *link, + u32 flags, u32 id_or_fd, u64 revision) +{ + struct bpf_tuple rtuple, dtuple = { + .prog = prog, + .link = link, + }; + int ret, idx = -ERANGE, tidx; + + if (flags & BPF_F_REPLACE) + return -EINVAL; + if (revision && revision != bpf_mprog_revision(entry)) + return -ESTALE; + ret = bpf_mprog_tuple_relative(&rtuple, id_or_fd, flags, + prog ? prog->type : + BPF_PROG_TYPE_UNSPEC); + if (ret) + return ret; + if (dtuple.prog) { + tidx = bpf_mprog_pos_exact(entry, &dtuple); + if (tidx < 0) { + ret = tidx; + goto out; + } + idx = tidx; + } + if (flags & BPF_F_BEFORE) { + tidx = bpf_mprog_pos_before(entry, &rtuple); + if (tidx < -1 || (idx >= -1 && tidx != idx)) { + ret = tidx < -1 ? tidx : -ERANGE; + goto out; + } + idx = tidx; + } + if (flags & BPF_F_AFTER) { + tidx = bpf_mprog_pos_after(entry, &rtuple); + if (tidx < -1 || (idx >= -1 && tidx != idx)) { + ret = tidx < 0 ? tidx : -ERANGE; + goto out; + } + idx = tidx; + } + if (idx < -1) { + if (rtuple.prog || flags) { + ret = -EINVAL; + goto out; + } + idx = bpf_mprog_total(entry); + flags = BPF_F_AFTER; + } + if (idx >= bpf_mprog_max()) { + ret = -ERANGE; + goto out; + } + ret = bpf_mprog_fetch(entry, &dtuple, idx); + if (ret) + goto out; + ret = bpf_mprog_delete(entry, entry_new, &dtuple, idx); +out: + bpf_mprog_tuple_put(&rtuple); + return ret; +} + +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, + struct bpf_mprog_entry *entry) +{ + u32 __user *uprog_flags, *ulink_flags; + u32 __user *uprog_id, *ulink_id; + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + struct bpf_prog *prog; + const u32 flags = 0; + int i, ret = 0; + u32 id, count; + u64 revision; + + if (attr->query.query_flags || attr->query.attach_flags) + return -EINVAL; + revision = bpf_mprog_revision(entry); + count = bpf_mprog_total(entry); + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) + return -EFAULT; + if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision))) + return -EFAULT; + if (copy_to_user(&uattr->query.count, &count, sizeof(count))) + return -EFAULT; + uprog_id = u64_to_user_ptr(attr->query.prog_ids); + uprog_flags = u64_to_user_ptr(attr->query.prog_attach_flags); + ulink_id = u64_to_user_ptr(attr->query.link_ids); + ulink_flags = u64_to_user_ptr(attr->query.link_attach_flags); + if (attr->query.count == 0 || !uprog_id || !count) + return 0; + if (attr->query.count < count) { + count = attr->query.count; + ret = -ENOSPC; + } + for (i = 0; i < bpf_mprog_max(); i++) { + bpf_mprog_read(entry, i, &fp, &cp); + prog = READ_ONCE(fp->prog); + if (!prog) + break; + id = prog->aux->id; + if (copy_to_user(uprog_id + i, &id, sizeof(id))) + return -EFAULT; + if (uprog_flags && + copy_to_user(uprog_flags + i, &flags, sizeof(flags))) + return -EFAULT; + id = cp->link ? cp->link->id : 0; + if (ulink_id && + copy_to_user(ulink_id + i, &id, sizeof(id))) + return -EFAULT; + if (ulink_flags && + copy_to_user(ulink_flags + i, &flags, sizeof(flags))) + return -EFAULT; + if (i + 1 == count) + break; + } + return ret; +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 600d0caebbd8..1c166870cdf3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1113,7 +1113,12 @@ enum bpf_perf_event_type { */ #define BPF_F_ALLOW_OVERRIDE (1U << 0) #define BPF_F_ALLOW_MULTI (1U << 1) +/* Generic attachment flags. */ #define BPF_F_REPLACE (1U << 2) +#define BPF_F_BEFORE (1U << 3) +#define BPF_F_AFTER (1U << 4) +#define BPF_F_ID (1U << 5) +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */ /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the * verifier will perform strict alignment checking as if the kernel @@ -1444,14 +1449,19 @@ union bpf_attr { }; struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ - __u32 target_fd; /* container object to attach to */ - __u32 attach_bpf_fd; /* eBPF program to attach */ + union { + __u32 target_fd; /* target object to attach to or ... */ + __u32 target_ifindex; /* target ifindex */ + }; + __u32 attach_bpf_fd; __u32 attach_type; __u32 attach_flags; - __u32 replace_bpf_fd; /* previously attached eBPF - * program to replace if - * BPF_F_REPLACE is used - */ + __u32 replace_bpf_fd; + union { + __u32 relative_fd; + __u32 relative_id; + }; + __u64 expected_revision; }; struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ @@ -1497,16 +1507,26 @@ union bpf_attr { } info; struct { /* anonymous struct used by BPF_PROG_QUERY command */ - __u32 target_fd; /* container object to query */ + union { + __u32 target_fd; /* target object to query or ... */ + __u32 target_ifindex; /* target ifindex */ + }; __u32 attach_type; __u32 query_flags; __u32 attach_flags; __aligned_u64 prog_ids; - __u32 prog_cnt; + union { + __u32 prog_cnt; + __u32 count; + }; + __u32 :32; /* output: per-program attach_flags. * not allowed to be set during effective query. */ __aligned_u64 prog_attach_flags; + __aligned_u64 link_ids; + __aligned_u64 link_attach_flags; + __u64 revision; } query; struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ -- cgit v1.2.3 From e420bed025071a623d2720a92bc2245c84757ecb Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 19 Jul 2023 16:08:52 +0200 Subject: bpf: Add fd-based tcx multi-prog infra with link support This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann Acked-by: Jakub Kicinski Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov --- MAINTAINERS | 4 +- include/linux/bpf_mprog.h | 9 ++ include/linux/netdevice.h | 15 +- include/linux/skbuff.h | 4 +- include/net/sch_generic.h | 2 +- include/net/tcx.h | 206 ++++++++++++++++++++++++ include/uapi/linux/bpf.h | 34 +++- kernel/bpf/Kconfig | 1 + kernel/bpf/Makefile | 1 + kernel/bpf/syscall.c | 82 ++++++++-- kernel/bpf/tcx.c | 348 +++++++++++++++++++++++++++++++++++++++++ net/Kconfig | 5 + net/core/dev.c | 265 +++++++++++++++++++------------ net/core/filter.c | 4 +- net/sched/Kconfig | 4 +- net/sched/sch_ingress.c | 61 +++++++- tools/include/uapi/linux/bpf.h | 34 +++- 17 files changed, 935 insertions(+), 144 deletions(-) create mode 100644 include/net/tcx.h create mode 100644 kernel/bpf/tcx.c (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 678bef9f60b4..990e3fce753c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3778,13 +3778,15 @@ L: netdev@vger.kernel.org S: Maintained F: kernel/bpf/bpf_struct* -BPF [NETWORKING] (tc BPF, sock_addr) +BPF [NETWORKING] (tcx & tc BPF, sock_addr) M: Martin KaFai Lau M: Daniel Borkmann R: John Fastabend L: bpf@vger.kernel.org L: netdev@vger.kernel.org S: Maintained +F: include/net/tcx.h +F: kernel/bpf/tcx.c F: net/core/filter.c F: net/sched/act_bpf.c F: net/sched/cls_bpf.c diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h index 6feefec43422..2b429488f840 100644 --- a/include/linux/bpf_mprog.h +++ b/include/linux/bpf_mprog.h @@ -315,4 +315,13 @@ int bpf_mprog_detach(struct bpf_mprog_entry *entry, int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, struct bpf_mprog_entry *entry); +static inline bool bpf_mprog_supported(enum bpf_prog_type type) +{ + switch (type) { + case BPF_PROG_TYPE_SCHED_CLS: + return true; + default: + return false; + } +} #endif /* __BPF_MPROG_H */ diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b12477ea4032..3800d0479698 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1930,8 +1930,7 @@ enum netdev_ml_priv_type { * * @rx_handler: handler for received packets * @rx_handler_data: XXX: need comments on this one - * @miniq_ingress: ingress/clsact qdisc specific data for - * ingress processing + * @tcx_ingress: BPF & clsact qdisc specific data for ingress processing * @ingress_queue: XXX: need comments on this one * @nf_hooks_ingress: netfilter hooks executed for ingress packets * @broadcast: hw bcast address @@ -1952,8 +1951,7 @@ enum netdev_ml_priv_type { * @xps_maps: all CPUs/RXQs maps for XPS device * * @xps_maps: XXX: need comments on this one - * @miniq_egress: clsact qdisc specific data for - * egress processing + * @tcx_egress: BPF & clsact qdisc specific data for egress processing * @nf_hooks_egress: netfilter hooks executed for egress packets * @qdisc_hash: qdisc hash table * @watchdog_timeo: Represents the timeout that is used by @@ -2253,9 +2251,8 @@ struct net_device { unsigned int xdp_zc_max_segs; rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; - -#ifdef CONFIG_NET_CLS_ACT - struct mini_Qdisc __rcu *miniq_ingress; +#ifdef CONFIG_NET_XGRESS + struct bpf_mprog_entry __rcu *tcx_ingress; #endif struct netdev_queue __rcu *ingress_queue; #ifdef CONFIG_NETFILTER_INGRESS @@ -2283,8 +2280,8 @@ struct net_device { #ifdef CONFIG_XPS struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX]; #endif -#ifdef CONFIG_NET_CLS_ACT - struct mini_Qdisc __rcu *miniq_egress; +#ifdef CONFIG_NET_XGRESS + struct bpf_mprog_entry __rcu *tcx_egress; #endif #ifdef CONFIG_NETFILTER_EGRESS struct nf_hook_entries __rcu *nf_hooks_egress; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 91ed66952580..ed83f1c5fc1f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -944,7 +944,7 @@ struct sk_buff { __u8 __mono_tc_offset[0]; /* public: */ __u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */ -#ifdef CONFIG_NET_CLS_ACT +#ifdef CONFIG_NET_XGRESS __u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */ __u8 tc_skip_classify:1; #endif @@ -993,7 +993,7 @@ struct sk_buff { __u8 csum_not_inet:1; #endif -#ifdef CONFIG_NET_SCHED +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index e92f73bb3198..15be2d96b06d 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -703,7 +703,7 @@ int skb_do_redirect(struct sk_buff *); static inline bool skb_at_tc_ingress(const struct sk_buff *skb) { -#ifdef CONFIG_NET_CLS_ACT +#ifdef CONFIG_NET_XGRESS return skb->tc_at_ingress; #else return false; diff --git a/include/net/tcx.h b/include/net/tcx.h new file mode 100644 index 000000000000..264f147953ba --- /dev/null +++ b/include/net/tcx.h @@ -0,0 +1,206 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2023 Isovalent */ +#ifndef __NET_TCX_H +#define __NET_TCX_H + +#include +#include + +#include + +struct mini_Qdisc; + +struct tcx_entry { + struct mini_Qdisc __rcu *miniq; + struct bpf_mprog_bundle bundle; + bool miniq_active; + struct rcu_head rcu; +}; + +struct tcx_link { + struct bpf_link link; + struct net_device *dev; + u32 location; +}; + +static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress) +{ +#ifdef CONFIG_NET_XGRESS + skb->tc_at_ingress = ingress; +#endif +} + +#ifdef CONFIG_NET_XGRESS +static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry) +{ + struct bpf_mprog_bundle *bundle = entry->parent; + + return container_of(bundle, struct tcx_entry, bundle); +} + +static inline struct tcx_link *tcx_link(struct bpf_link *link) +{ + return container_of(link, struct tcx_link, link); +} + +static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link) +{ + return tcx_link((struct bpf_link *)link); +} + +void tcx_inc(void); +void tcx_dec(void); + +static inline void tcx_entry_sync(void) +{ + /* bpf_mprog_entry got a/b swapped, therefore ensure that + * there are no inflight users on the old one anymore. + */ + synchronize_rcu(); +} + +static inline void +tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, + bool ingress) +{ + ASSERT_RTNL(); + if (ingress) + rcu_assign_pointer(dev->tcx_ingress, entry); + else + rcu_assign_pointer(dev->tcx_egress, entry); +} + +static inline struct bpf_mprog_entry * +tcx_entry_fetch(struct net_device *dev, bool ingress) +{ + ASSERT_RTNL(); + if (ingress) + return rcu_dereference_rtnl(dev->tcx_ingress); + else + return rcu_dereference_rtnl(dev->tcx_egress); +} + +static inline struct bpf_mprog_entry *tcx_entry_create(void) +{ + struct tcx_entry *tcx = kzalloc(sizeof(*tcx), GFP_KERNEL); + + if (tcx) { + bpf_mprog_bundle_init(&tcx->bundle); + return &tcx->bundle.a; + } + return NULL; +} + +static inline void tcx_entry_free(struct bpf_mprog_entry *entry) +{ + kfree_rcu(tcx_entry(entry), rcu); +} + +static inline struct bpf_mprog_entry * +tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created) +{ + struct bpf_mprog_entry *entry = tcx_entry_fetch(dev, ingress); + + *created = false; + if (!entry) { + entry = tcx_entry_create(); + if (!entry) + return NULL; + *created = true; + } + return entry; +} + +static inline void tcx_skeys_inc(bool ingress) +{ + tcx_inc(); + if (ingress) + net_inc_ingress_queue(); + else + net_inc_egress_queue(); +} + +static inline void tcx_skeys_dec(bool ingress) +{ + if (ingress) + net_dec_ingress_queue(); + else + net_dec_egress_queue(); + tcx_dec(); +} + +static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry, + const bool active) +{ + ASSERT_RTNL(); + tcx_entry(entry)->miniq_active = active; +} + +static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry) +{ + ASSERT_RTNL(); + return bpf_mprog_total(entry) || tcx_entry(entry)->miniq_active; +} + +static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, + int code) +{ + switch (code) { + case TCX_PASS: + skb->tc_index = qdisc_skb_cb(skb)->tc_classid; + fallthrough; + case TCX_DROP: + case TCX_REDIRECT: + return code; + case TCX_NEXT: + default: + return TCX_NEXT; + } +} +#endif /* CONFIG_NET_XGRESS */ + +#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL) +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog); +void tcx_uninstall(struct net_device *dev, bool ingress); + +int tcx_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr); + +static inline void dev_tcx_uninstall(struct net_device *dev) +{ + ASSERT_RTNL(); + tcx_uninstall(dev, true); + tcx_uninstall(dev, false); +} +#else +static inline int tcx_prog_attach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EINVAL; +} + +static inline int tcx_link_attach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EINVAL; +} + +static inline int tcx_prog_detach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EINVAL; +} + +static inline int tcx_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + return -EINVAL; +} + +static inline void dev_tcx_uninstall(struct net_device *dev) +{ +} +#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */ +#endif /* __NET_TCX_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d4c07e435336..739c15906a65 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1036,6 +1036,8 @@ enum bpf_attach_type { BPF_LSM_CGROUP, BPF_STRUCT_OPS, BPF_NETFILTER, + BPF_TCX_INGRESS, + BPF_TCX_EGRESS, __MAX_BPF_ATTACH_TYPE }; @@ -1053,7 +1055,7 @@ enum bpf_link_type { BPF_LINK_TYPE_KPROBE_MULTI = 8, BPF_LINK_TYPE_STRUCT_OPS = 9, BPF_LINK_TYPE_NETFILTER = 10, - + BPF_LINK_TYPE_TCX = 11, MAX_BPF_LINK_TYPE, }; @@ -1569,13 +1571,13 @@ union bpf_attr { __u32 map_fd; /* struct_ops to attach */ }; union { - __u32 target_fd; /* object to attach to */ - __u32 target_ifindex; /* target ifindex */ + __u32 target_fd; /* target object to attach to or ... */ + __u32 target_ifindex; /* target ifindex */ }; __u32 attach_type; /* attach type */ __u32 flags; /* extra flags */ union { - __u32 target_btf_id; /* btf_id of target to attach to */ + __u32 target_btf_id; /* btf_id of target to attach to */ struct { __aligned_u64 iter_info; /* extra bpf_iter_link_info */ __u32 iter_info_len; /* iter_info length */ @@ -1609,6 +1611,13 @@ union bpf_attr { __s32 priority; __u32 flags; } netfilter; + struct { + union { + __u32 relative_fd; + __u32 relative_id; + }; + __u64 expected_revision; + } tcx; }; } link_create; @@ -6217,6 +6226,19 @@ struct bpf_sock_tuple { }; }; +/* (Simplified) user return codes for tcx prog type. + * A valid tcx program must return one of these defined values. All other + * return codes are reserved for future use. Must remain compatible with + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown + * return codes are mapped to TCX_NEXT. + */ +enum tcx_action_base { + TCX_NEXT = -1, + TCX_PASS = 0, + TCX_DROP = 2, + TCX_REDIRECT = 7, +}; + struct bpf_xdp_sock { __u32 queue_id; }; @@ -6499,6 +6521,10 @@ struct bpf_link_info { } event; /* BPF_PERF_EVENT_EVENT */ }; } perf_event; + struct { + __u32 ifindex; + __u32 attach_type; + } tcx; }; } __attribute__((aligned(8))); diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 2dfe1079f772..6a906ff93006 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -31,6 +31,7 @@ config BPF_SYSCALL select TASKS_TRACE_RCU select BINARY_PRINTF select NET_SOCK_MSG if NET + select NET_XGRESS if NET select PAGE_POOL if NET default n help diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 1bea2eb912cd..f526b7573e97 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o obj-$(CONFIG_BPF_SYSCALL) += cpumap.o obj-$(CONFIG_BPF_SYSCALL) += offload.o obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o +obj-$(CONFIG_BPF_SYSCALL) += tcx.o endif ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ee8cb1a174aa..7f4e8c357a6a 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -37,6 +37,8 @@ #include #include +#include + #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) @@ -3740,31 +3742,45 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type) return BPF_PROG_TYPE_XDP; case BPF_LSM_CGROUP: return BPF_PROG_TYPE_LSM; + case BPF_TCX_INGRESS: + case BPF_TCX_EGRESS: + return BPF_PROG_TYPE_SCHED_CLS; default: return BPF_PROG_TYPE_UNSPEC; } } -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd +#define BPF_PROG_ATTACH_LAST_FIELD expected_revision + +#define BPF_F_ATTACH_MASK_BASE \ + (BPF_F_ALLOW_OVERRIDE | \ + BPF_F_ALLOW_MULTI | \ + BPF_F_REPLACE) -#define BPF_F_ATTACH_MASK \ - (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE) +#define BPF_F_ATTACH_MASK_MPROG \ + (BPF_F_REPLACE | \ + BPF_F_BEFORE | \ + BPF_F_AFTER | \ + BPF_F_ID | \ + BPF_F_LINK) static int bpf_prog_attach(const union bpf_attr *attr) { enum bpf_prog_type ptype; struct bpf_prog *prog; + u32 mask; int ret; if (CHECK_ATTR(BPF_PROG_ATTACH)) return -EINVAL; - if (attr->attach_flags & ~BPF_F_ATTACH_MASK) - return -EINVAL; - ptype = attach_type_to_prog_type(attr->attach_type); if (ptype == BPF_PROG_TYPE_UNSPEC) return -EINVAL; + mask = bpf_mprog_supported(ptype) ? + BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE; + if (attr->attach_flags & ~mask) + return -EINVAL; prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); if (IS_ERR(prog)) @@ -3800,6 +3816,9 @@ static int bpf_prog_attach(const union bpf_attr *attr) else ret = cgroup_bpf_prog_attach(attr, ptype, prog); break; + case BPF_PROG_TYPE_SCHED_CLS: + ret = tcx_prog_attach(attr, prog); + break; default: ret = -EINVAL; } @@ -3809,25 +3828,41 @@ static int bpf_prog_attach(const union bpf_attr *attr) return ret; } -#define BPF_PROG_DETACH_LAST_FIELD attach_type +#define BPF_PROG_DETACH_LAST_FIELD expected_revision static int bpf_prog_detach(const union bpf_attr *attr) { + struct bpf_prog *prog = NULL; enum bpf_prog_type ptype; + int ret; if (CHECK_ATTR(BPF_PROG_DETACH)) return -EINVAL; ptype = attach_type_to_prog_type(attr->attach_type); + if (bpf_mprog_supported(ptype)) { + if (ptype == BPF_PROG_TYPE_UNSPEC) + return -EINVAL; + if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG) + return -EINVAL; + if (attr->attach_bpf_fd) { + prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); + if (IS_ERR(prog)) + return PTR_ERR(prog); + } + } switch (ptype) { case BPF_PROG_TYPE_SK_MSG: case BPF_PROG_TYPE_SK_SKB: - return sock_map_prog_detach(attr, ptype); + ret = sock_map_prog_detach(attr, ptype); + break; case BPF_PROG_TYPE_LIRC_MODE2: - return lirc_prog_detach(attr); + ret = lirc_prog_detach(attr); + break; case BPF_PROG_TYPE_FLOW_DISSECTOR: - return netns_bpf_prog_detach(attr, ptype); + ret = netns_bpf_prog_detach(attr, ptype); + break; case BPF_PROG_TYPE_CGROUP_DEVICE: case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: @@ -3836,13 +3871,21 @@ static int bpf_prog_detach(const union bpf_attr *attr) case BPF_PROG_TYPE_CGROUP_SYSCTL: case BPF_PROG_TYPE_SOCK_OPS: case BPF_PROG_TYPE_LSM: - return cgroup_bpf_prog_detach(attr, ptype); + ret = cgroup_bpf_prog_detach(attr, ptype); + break; + case BPF_PROG_TYPE_SCHED_CLS: + ret = tcx_prog_detach(attr, prog); + break; default: - return -EINVAL; + ret = -EINVAL; } + + if (prog) + bpf_prog_put(prog); + return ret; } -#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags +#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags static int bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) @@ -3890,6 +3933,9 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_SK_MSG_VERDICT: case BPF_SK_SKB_VERDICT: return sock_map_bpf_prog_query(attr, uattr); + case BPF_TCX_INGRESS: + case BPF_TCX_EGRESS: + return tcx_prog_query(attr, uattr); default: return -EINVAL; } @@ -4852,6 +4898,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr) goto out; } break; + case BPF_PROG_TYPE_SCHED_CLS: + if (attr->link_create.attach_type != BPF_TCX_INGRESS && + attr->link_create.attach_type != BPF_TCX_EGRESS) { + ret = -EINVAL; + goto out; + } + break; default: ptype = attach_type_to_prog_type(attr->link_create.attach_type); if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) { @@ -4903,6 +4956,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr) case BPF_PROG_TYPE_XDP: ret = bpf_xdp_link_attach(attr, prog); break; + case BPF_PROG_TYPE_SCHED_CLS: + ret = tcx_link_attach(attr, prog); + break; case BPF_PROG_TYPE_NETFILTER: ret = bpf_nf_link_attach(attr, prog); break; diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c new file mode 100644 index 000000000000..69a272712b29 --- /dev/null +++ b/kernel/bpf/tcx.c @@ -0,0 +1,348 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Isovalent */ + +#include +#include +#include + +#include + +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + bool created, ingress = attr->attach_type == BPF_TCX_INGRESS; + struct net *net = current->nsproxy->net_ns; + struct bpf_mprog_entry *entry, *entry_new; + struct bpf_prog *replace_prog = NULL; + struct net_device *dev; + int ret; + + rtnl_lock(); + dev = __dev_get_by_index(net, attr->target_ifindex); + if (!dev) { + ret = -ENODEV; + goto out; + } + if (attr->attach_flags & BPF_F_REPLACE) { + replace_prog = bpf_prog_get_type(attr->replace_bpf_fd, + prog->type); + if (IS_ERR(replace_prog)) { + ret = PTR_ERR(replace_prog); + replace_prog = NULL; + goto out; + } + } + entry = tcx_entry_fetch_or_create(dev, ingress, &created); + if (!entry) { + ret = -ENOMEM; + goto out; + } + ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog, + attr->attach_flags, attr->relative_fd, + attr->expected_revision); + if (!ret) { + if (entry != entry_new) { + tcx_entry_update(dev, entry_new, ingress); + tcx_entry_sync(); + tcx_skeys_inc(ingress); + } + bpf_mprog_commit(entry); + } else if (created) { + tcx_entry_free(entry); + } +out: + if (replace_prog) + bpf_prog_put(replace_prog); + rtnl_unlock(); + return ret; +} + +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + bool ingress = attr->attach_type == BPF_TCX_INGRESS; + struct net *net = current->nsproxy->net_ns; + struct bpf_mprog_entry *entry, *entry_new; + struct net_device *dev; + int ret; + + rtnl_lock(); + dev = __dev_get_by_index(net, attr->target_ifindex); + if (!dev) { + ret = -ENODEV; + goto out; + } + entry = tcx_entry_fetch(dev, ingress); + if (!entry) { + ret = -ENOENT; + goto out; + } + ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags, + attr->relative_fd, attr->expected_revision); + if (!ret) { + if (!tcx_entry_is_active(entry_new)) + entry_new = NULL; + tcx_entry_update(dev, entry_new, ingress); + tcx_entry_sync(); + tcx_skeys_dec(ingress); + bpf_mprog_commit(entry); + if (!entry_new) + tcx_entry_free(entry); + } +out: + rtnl_unlock(); + return ret; +} + +void tcx_uninstall(struct net_device *dev, bool ingress) +{ + struct bpf_tuple tuple = {}; + struct bpf_mprog_entry *entry; + struct bpf_mprog_fp *fp; + struct bpf_mprog_cp *cp; + + entry = tcx_entry_fetch(dev, ingress); + if (!entry) + return; + tcx_entry_update(dev, NULL, ingress); + tcx_entry_sync(); + bpf_mprog_foreach_tuple(entry, fp, cp, tuple) { + if (tuple.link) + tcx_link(tuple.link)->dev = NULL; + else + bpf_prog_put(tuple.prog); + tcx_skeys_dec(ingress); + } + WARN_ON_ONCE(tcx_entry(entry)->miniq_active); + tcx_entry_free(entry); +} + +int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) +{ + bool ingress = attr->query.attach_type == BPF_TCX_INGRESS; + struct net *net = current->nsproxy->net_ns; + struct bpf_mprog_entry *entry; + struct net_device *dev; + int ret; + + rtnl_lock(); + dev = __dev_get_by_index(net, attr->query.target_ifindex); + if (!dev) { + ret = -ENODEV; + goto out; + } + entry = tcx_entry_fetch(dev, ingress); + if (!entry) { + ret = -ENOENT; + goto out; + } + ret = bpf_mprog_query(attr, uattr, entry); +out: + rtnl_unlock(); + return ret; +} + +static int tcx_link_prog_attach(struct bpf_link *link, u32 flags, u32 id_or_fd, + u64 revision) +{ + struct tcx_link *tcx = tcx_link(link); + bool created, ingress = tcx->location == BPF_TCX_INGRESS; + struct bpf_mprog_entry *entry, *entry_new; + struct net_device *dev = tcx->dev; + int ret; + + ASSERT_RTNL(); + entry = tcx_entry_fetch_or_create(dev, ingress, &created); + if (!entry) + return -ENOMEM; + ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags, + id_or_fd, revision); + if (!ret) { + if (entry != entry_new) { + tcx_entry_update(dev, entry_new, ingress); + tcx_entry_sync(); + tcx_skeys_inc(ingress); + } + bpf_mprog_commit(entry); + } else if (created) { + tcx_entry_free(entry); + } + return ret; +} + +static void tcx_link_release(struct bpf_link *link) +{ + struct tcx_link *tcx = tcx_link(link); + bool ingress = tcx->location == BPF_TCX_INGRESS; + struct bpf_mprog_entry *entry, *entry_new; + struct net_device *dev; + int ret = 0; + + rtnl_lock(); + dev = tcx->dev; + if (!dev) + goto out; + entry = tcx_entry_fetch(dev, ingress); + if (!entry) { + ret = -ENOENT; + goto out; + } + ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0); + if (!ret) { + if (!tcx_entry_is_active(entry_new)) + entry_new = NULL; + tcx_entry_update(dev, entry_new, ingress); + tcx_entry_sync(); + tcx_skeys_dec(ingress); + bpf_mprog_commit(entry); + if (!entry_new) + tcx_entry_free(entry); + tcx->dev = NULL; + } +out: + WARN_ON_ONCE(ret); + rtnl_unlock(); +} + +static int tcx_link_update(struct bpf_link *link, struct bpf_prog *nprog, + struct bpf_prog *oprog) +{ + struct tcx_link *tcx = tcx_link(link); + bool ingress = tcx->location == BPF_TCX_INGRESS; + struct bpf_mprog_entry *entry, *entry_new; + struct net_device *dev; + int ret = 0; + + rtnl_lock(); + dev = tcx->dev; + if (!dev) { + ret = -ENOLINK; + goto out; + } + if (oprog && link->prog != oprog) { + ret = -EPERM; + goto out; + } + oprog = link->prog; + if (oprog == nprog) { + bpf_prog_put(nprog); + goto out; + } + entry = tcx_entry_fetch(dev, ingress); + if (!entry) { + ret = -ENOENT; + goto out; + } + ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog, + BPF_F_REPLACE | BPF_F_ID, + link->prog->aux->id, 0); + if (!ret) { + WARN_ON_ONCE(entry != entry_new); + oprog = xchg(&link->prog, nprog); + bpf_prog_put(oprog); + bpf_mprog_commit(entry); + } +out: + rtnl_unlock(); + return ret; +} + +static void tcx_link_dealloc(struct bpf_link *link) +{ + kfree(tcx_link(link)); +} + +static void tcx_link_fdinfo(const struct bpf_link *link, struct seq_file *seq) +{ + const struct tcx_link *tcx = tcx_link_const(link); + u32 ifindex = 0; + + rtnl_lock(); + if (tcx->dev) + ifindex = tcx->dev->ifindex; + rtnl_unlock(); + + seq_printf(seq, "ifindex:\t%u\n", ifindex); + seq_printf(seq, "attach_type:\t%u (%s)\n", + tcx->location, + tcx->location == BPF_TCX_INGRESS ? "ingress" : "egress"); +} + +static int tcx_link_fill_info(const struct bpf_link *link, + struct bpf_link_info *info) +{ + const struct tcx_link *tcx = tcx_link_const(link); + u32 ifindex = 0; + + rtnl_lock(); + if (tcx->dev) + ifindex = tcx->dev->ifindex; + rtnl_unlock(); + + info->tcx.ifindex = ifindex; + info->tcx.attach_type = tcx->location; + return 0; +} + +static int tcx_link_detach(struct bpf_link *link) +{ + tcx_link_release(link); + return 0; +} + +static const struct bpf_link_ops tcx_link_lops = { + .release = tcx_link_release, + .detach = tcx_link_detach, + .dealloc = tcx_link_dealloc, + .update_prog = tcx_link_update, + .show_fdinfo = tcx_link_fdinfo, + .fill_link_info = tcx_link_fill_info, +}; + +static int tcx_link_init(struct tcx_link *tcx, + struct bpf_link_primer *link_primer, + const union bpf_attr *attr, + struct net_device *dev, + struct bpf_prog *prog) +{ + bpf_link_init(&tcx->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog); + tcx->location = attr->link_create.attach_type; + tcx->dev = dev; + return bpf_link_prime(&tcx->link, link_primer); +} + +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + struct net *net = current->nsproxy->net_ns; + struct bpf_link_primer link_primer; + struct net_device *dev; + struct tcx_link *tcx; + int ret; + + rtnl_lock(); + dev = __dev_get_by_index(net, attr->link_create.target_ifindex); + if (!dev) { + ret = -ENODEV; + goto out; + } + tcx = kzalloc(sizeof(*tcx), GFP_USER); + if (!tcx) { + ret = -ENOMEM; + goto out; + } + ret = tcx_link_init(tcx, &link_primer, attr, dev, prog); + if (ret) { + kfree(tcx); + goto out; + } + ret = tcx_link_prog_attach(&tcx->link, attr->link_create.flags, + attr->link_create.tcx.relative_fd, + attr->link_create.tcx.expected_revision); + if (ret) { + tcx->dev = NULL; + bpf_link_cleanup(&link_primer); + goto out; + } + ret = bpf_link_settle(&link_primer); +out: + rtnl_unlock(); + return ret; +} diff --git a/net/Kconfig b/net/Kconfig index 2fb25b534df5..d532ec33f1fe 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -52,6 +52,11 @@ config NET_INGRESS config NET_EGRESS bool +config NET_XGRESS + select NET_INGRESS + select NET_EGRESS + bool + config NET_REDIRECT bool diff --git a/net/core/dev.c b/net/core/dev.c index dd4f114a7cbf..8e7d0cb540cd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -107,6 +107,7 @@ #include #include #include +#include #include #include #include @@ -154,7 +155,6 @@ #include "dev.h" #include "net-sysfs.h" - static DEFINE_SPINLOCK(ptype_lock); struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly; struct list_head ptype_all __read_mostly; /* Taps */ @@ -3882,69 +3882,198 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) EXPORT_SYMBOL(dev_loopback_xmit); #ifdef CONFIG_NET_EGRESS -static struct sk_buff * -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) +static struct netdev_queue * +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) +{ + int qm = skb_get_queue_mapping(skb); + + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); +} + +static bool netdev_xmit_txqueue_skipped(void) { + return __this_cpu_read(softnet_data.xmit.skip_txqueue); +} + +void netdev_xmit_skip_txqueue(bool skip) +{ + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); +} +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); +#endif /* CONFIG_NET_EGRESS */ + +#ifdef CONFIG_NET_XGRESS +static int tc_run(struct tcx_entry *entry, struct sk_buff *skb) +{ + int ret = TC_ACT_UNSPEC; #ifdef CONFIG_NET_CLS_ACT - struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress); - struct tcf_result cl_res; + struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq); + struct tcf_result res; if (!miniq) - return skb; + return ret; - /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */ tc_skb_cb(skb)->mru = 0; tc_skb_cb(skb)->post_ct = false; - mini_qdisc_bstats_cpu_update(miniq, skb); - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { + mini_qdisc_bstats_cpu_update(miniq, skb); + ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false); + /* Only tcf related quirks below. */ + switch (ret) { + case TC_ACT_SHOT: + mini_qdisc_qstats_cpu_drop(miniq); + break; case TC_ACT_OK: case TC_ACT_RECLASSIFY: - skb->tc_index = TC_H_MIN(cl_res.classid); + skb->tc_index = TC_H_MIN(res.classid); break; + } +#endif /* CONFIG_NET_CLS_ACT */ + return ret; +} + +static DEFINE_STATIC_KEY_FALSE(tcx_needed_key); + +void tcx_inc(void) +{ + static_branch_inc(&tcx_needed_key); +} + +void tcx_dec(void) +{ + static_branch_dec(&tcx_needed_key); +} + +static __always_inline enum tcx_action_base +tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb, + const bool needs_mac) +{ + const struct bpf_mprog_fp *fp; + const struct bpf_prog *prog; + int ret = TCX_NEXT; + + if (needs_mac) + __skb_push(skb, skb->mac_len); + bpf_mprog_foreach_prog(entry, fp, prog) { + bpf_compute_data_pointers(skb); + ret = bpf_prog_run(prog, skb); + if (ret != TCX_NEXT) + break; + } + if (needs_mac) + __skb_pull(skb, skb->mac_len); + return tcx_action_code(skb, ret); +} + +static __always_inline struct sk_buff * +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, + struct net_device *orig_dev, bool *another) +{ + struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); + int sch_ret; + + if (!entry) + return skb; + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + qdisc_skb_cb(skb)->pkt_len = skb->len; + tcx_set_ingress(skb, true); + + if (static_branch_unlikely(&tcx_needed_key)) { + sch_ret = tcx_run(entry, skb, true); + if (sch_ret != TC_ACT_UNSPEC) + goto ingress_verdict; + } + sch_ret = tc_run(tcx_entry(entry), skb); +ingress_verdict: + switch (sch_ret) { + case TC_ACT_REDIRECT: + /* skb_mac_header check was done by BPF, so we can safely + * push the L2 header back before redirecting to another + * netdev. + */ + __skb_push(skb, skb->mac_len); + if (skb_do_redirect(skb) == -EAGAIN) { + __skb_pull(skb, skb->mac_len); + *another = true; + break; + } + *ret = NET_RX_SUCCESS; + return NULL; case TC_ACT_SHOT: - mini_qdisc_qstats_cpu_drop(miniq); - *ret = NET_XMIT_DROP; - kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); + kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); + *ret = NET_RX_DROP; return NULL; + /* used by tc_run */ case TC_ACT_STOLEN: case TC_ACT_QUEUED: case TC_ACT_TRAP: - *ret = NET_XMIT_SUCCESS; consume_skb(skb); + fallthrough; + case TC_ACT_CONSUMED: + *ret = NET_RX_SUCCESS; return NULL; + } + + return skb; +} + +static __always_inline struct sk_buff * +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) +{ + struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); + int sch_ret; + + if (!entry) + return skb; + + /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was + * already set by the caller. + */ + if (static_branch_unlikely(&tcx_needed_key)) { + sch_ret = tcx_run(entry, skb, false); + if (sch_ret != TC_ACT_UNSPEC) + goto egress_verdict; + } + sch_ret = tc_run(tcx_entry(entry), skb); +egress_verdict: + switch (sch_ret) { case TC_ACT_REDIRECT: /* No need to push/pop skb's mac_header here on egress! */ skb_do_redirect(skb); *ret = NET_XMIT_SUCCESS; return NULL; - default: - break; + case TC_ACT_SHOT: + kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); + *ret = NET_XMIT_DROP; + return NULL; + /* used by tc_run */ + case TC_ACT_STOLEN: + case TC_ACT_QUEUED: + case TC_ACT_TRAP: + *ret = NET_XMIT_SUCCESS; + return NULL; } -#endif /* CONFIG_NET_CLS_ACT */ return skb; } - -static struct netdev_queue * -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) -{ - int qm = skb_get_queue_mapping(skb); - - return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); -} - -static bool netdev_xmit_txqueue_skipped(void) +#else +static __always_inline struct sk_buff * +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, + struct net_device *orig_dev, bool *another) { - return __this_cpu_read(softnet_data.xmit.skip_txqueue); + return skb; } -void netdev_xmit_skip_txqueue(bool skip) +static __always_inline struct sk_buff * +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) { - __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); + return skb; } -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); -#endif /* CONFIG_NET_EGRESS */ +#endif /* CONFIG_NET_XGRESS */ #ifdef CONFIG_XPS static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, @@ -4128,9 +4257,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) skb_update_prio(skb); qdisc_pkt_len_init(skb); -#ifdef CONFIG_NET_CLS_ACT - skb->tc_at_ingress = 0; -#endif + tcx_set_ingress(skb, false); #ifdef CONFIG_NET_EGRESS if (static_branch_unlikely(&egress_needed_key)) { if (nf_hook_egress_active()) { @@ -5064,72 +5191,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev, EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook); #endif -static inline struct sk_buff * -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, - struct net_device *orig_dev, bool *another) -{ -#ifdef CONFIG_NET_CLS_ACT - struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress); - struct tcf_result cl_res; - - /* If there's at least one ingress present somewhere (so - * we get here via enabled static key), remaining devices - * that are not configured with an ingress qdisc will bail - * out here. - */ - if (!miniq) - return skb; - - if (*pt_prev) { - *ret = deliver_skb(skb, *pt_prev, orig_dev); - *pt_prev = NULL; - } - - qdisc_skb_cb(skb)->pkt_len = skb->len; - tc_skb_cb(skb)->mru = 0; - tc_skb_cb(skb)->post_ct = false; - skb->tc_at_ingress = 1; - mini_qdisc_bstats_cpu_update(miniq, skb); - - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { - case TC_ACT_OK: - case TC_ACT_RECLASSIFY: - skb->tc_index = TC_H_MIN(cl_res.classid); - break; - case TC_ACT_SHOT: - mini_qdisc_qstats_cpu_drop(miniq); - kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); - *ret = NET_RX_DROP; - return NULL; - case TC_ACT_STOLEN: - case TC_ACT_QUEUED: - case TC_ACT_TRAP: - consume_skb(skb); - *ret = NET_RX_SUCCESS; - return NULL; - case TC_ACT_REDIRECT: - /* skb_mac_header check was done by cls/act_bpf, so - * we can safely push the L2 header back before - * redirecting to another netdev - */ - __skb_push(skb, skb->mac_len); - if (skb_do_redirect(skb) == -EAGAIN) { - __skb_pull(skb, skb->mac_len); - *another = true; - break; - } - *ret = NET_RX_SUCCESS; - return NULL; - case TC_ACT_CONSUMED: - *ret = NET_RX_SUCCESS; - return NULL; - default: - break; - } -#endif /* CONFIG_NET_CLS_ACT */ - return skb; -} - /** * netdev_is_rx_handler_busy - check if receive handler is registered * @dev: device to check @@ -10835,7 +10896,7 @@ void unregister_netdevice_many_notify(struct list_head *head, /* Shutdown queueing discipline. */ dev_shutdown(dev); - + dev_tcx_uninstall(dev); dev_xdp_uninstall(dev); bpf_dev_bound_netdev_unregister(dev); diff --git a/net/core/filter.c b/net/core/filter.c index b4410dc841a0..797e8f039696 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9307,7 +9307,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog, __u8 value_reg = si->dst_reg; __u8 skb_reg = si->src_reg; -#ifdef CONFIG_NET_CLS_ACT +#ifdef CONFIG_NET_XGRESS /* If the tstamp_type is read, * the bpf prog is aware the tstamp could have delivery time. * Thus, read skb->tstamp as is if tstamp_type_access is true. @@ -9341,7 +9341,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog, __u8 value_reg = si->src_reg; __u8 skb_reg = si->dst_reg; -#ifdef CONFIG_NET_CLS_ACT +#ifdef CONFIG_NET_XGRESS /* If the tstamp_type is read, * the bpf prog is aware the tstamp could have delivery time. * Thus, write skb->tstamp as is if tstamp_type_access is true. diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 4b95cb1ac435..470c70deffe2 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE config NET_SCH_INGRESS tristate "Ingress/classifier-action Qdisc" depends on NET_CLS_ACT - select NET_INGRESS - select NET_EGRESS + select NET_XGRESS help Say Y here if you want to use classifiers for incoming and/or outgoing packets. This qdisc doesn't do anything else besides running classifiers, @@ -679,6 +678,7 @@ config NET_EMATCH_IPT config NET_CLS_ACT bool "Actions" select NET_CLS + select NET_XGRESS help Say Y here if you want to use traffic control actions. Actions get attached to classifiers and are invoked after a successful diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c index e43a45499372..04e886f6cee4 100644 --- a/net/sched/sch_ingress.c +++ b/net/sched/sch_ingress.c @@ -13,6 +13,7 @@ #include #include #include +#include struct ingress_sched_data { struct tcf_block *block; @@ -78,6 +79,8 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt, { struct ingress_sched_data *q = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); + struct bpf_mprog_entry *entry; + bool created; int err; if (sch->parent != TC_H_INGRESS) @@ -85,7 +88,13 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt, net_inc_ingress_queue(); - mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress); + entry = tcx_entry_fetch_or_create(dev, true, &created); + if (!entry) + return -ENOMEM; + tcx_miniq_set_active(entry, true); + mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq); + if (created) + tcx_entry_update(dev, entry, true); q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; q->block_info.chain_head_change = clsact_chain_head_change; @@ -103,11 +112,22 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt, static void ingress_destroy(struct Qdisc *sch) { struct ingress_sched_data *q = qdisc_priv(sch); + struct net_device *dev = qdisc_dev(sch); + struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress); if (sch->parent != TC_H_INGRESS) return; tcf_block_put_ext(q->block, sch, &q->block_info); + + if (entry) { + tcx_miniq_set_active(entry, false); + if (!tcx_entry_is_active(entry)) { + tcx_entry_update(dev, NULL, false); + tcx_entry_free(entry); + } + } + net_dec_ingress_queue(); } @@ -223,6 +243,8 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, { struct clsact_sched_data *q = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); + struct bpf_mprog_entry *entry; + bool created; int err; if (sch->parent != TC_H_CLSACT) @@ -231,7 +253,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, net_inc_ingress_queue(); net_inc_egress_queue(); - mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress); + entry = tcx_entry_fetch_or_create(dev, true, &created); + if (!entry) + return -ENOMEM; + tcx_miniq_set_active(entry, true); + mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq); + if (created) + tcx_entry_update(dev, entry, true); q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; q->ingress_block_info.chain_head_change = clsact_chain_head_change; @@ -244,7 +272,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block); - mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress); + entry = tcx_entry_fetch_or_create(dev, false, &created); + if (!entry) + return -ENOMEM; + tcx_miniq_set_active(entry, true); + mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq); + if (created) + tcx_entry_update(dev, entry, false); q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS; q->egress_block_info.chain_head_change = clsact_chain_head_change; @@ -256,12 +290,31 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, static void clsact_destroy(struct Qdisc *sch) { struct clsact_sched_data *q = qdisc_priv(sch); + struct net_device *dev = qdisc_dev(sch); + struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress); + struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress); if (sch->parent != TC_H_CLSACT) return; - tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info); + tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); + + if (ingress_entry) { + tcx_miniq_set_active(ingress_entry, false); + if (!tcx_entry_is_active(ingress_entry)) { + tcx_entry_update(dev, NULL, true); + tcx_entry_free(ingress_entry); + } + } + + if (egress_entry) { + tcx_miniq_set_active(egress_entry, false); + if (!tcx_entry_is_active(egress_entry)) { + tcx_entry_update(dev, NULL, false); + tcx_entry_free(egress_entry); + } + } net_dec_ingress_queue(); net_dec_egress_queue(); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1c166870cdf3..47b76925189f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1036,6 +1036,8 @@ enum bpf_attach_type { BPF_LSM_CGROUP, BPF_STRUCT_OPS, BPF_NETFILTER, + BPF_TCX_INGRESS, + BPF_TCX_EGRESS, __MAX_BPF_ATTACH_TYPE }; @@ -1053,7 +1055,7 @@ enum bpf_link_type { BPF_LINK_TYPE_KPROBE_MULTI = 8, BPF_LINK_TYPE_STRUCT_OPS = 9, BPF_LINK_TYPE_NETFILTER = 10, - + BPF_LINK_TYPE_TCX = 11, MAX_BPF_LINK_TYPE, }; @@ -1569,13 +1571,13 @@ union bpf_attr { __u32 map_fd; /* struct_ops to attach */ }; union { - __u32 target_fd; /* object to attach to */ - __u32 target_ifindex; /* target ifindex */ + __u32 target_fd; /* target object to attach to or ... */ + __u32 target_ifindex; /* target ifindex */ }; __u32 attach_type; /* attach type */ __u32 flags; /* extra flags */ union { - __u32 target_btf_id; /* btf_id of target to attach to */ + __u32 target_btf_id; /* btf_id of target to attach to */ struct { __aligned_u64 iter_info; /* extra bpf_iter_link_info */ __u32 iter_info_len; /* iter_info length */ @@ -1609,6 +1611,13 @@ union bpf_attr { __s32 priority; __u32 flags; } netfilter; + struct { + union { + __u32 relative_fd; + __u32 relative_id; + }; + __u64 expected_revision; + } tcx; }; } link_create; @@ -6217,6 +6226,19 @@ struct bpf_sock_tuple { }; }; +/* (Simplified) user return codes for tcx prog type. + * A valid tcx program must return one of these defined values. All other + * return codes are reserved for future use. Must remain compatible with + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown + * return codes are mapped to TCX_NEXT. + */ +enum tcx_action_base { + TCX_NEXT = -1, + TCX_PASS = 0, + TCX_DROP = 2, + TCX_REDIRECT = 7, +}; + struct bpf_xdp_sock { __u32 queue_id; }; @@ -6499,6 +6521,10 @@ struct bpf_link_info { } event; /* BPF_PERF_EVENT_EVENT */ }; } perf_event; + struct { + __u32 ifindex; + __u32 attach_type; + } tcx; }; } __attribute__((aligned(8))); -- cgit v1.2.3 From 070e8bd31b288c3aa3f965778417708c8bfedb06 Mon Sep 17 00:00:00 2001 From: Marc Kleine-Budde Date: Thu, 20 Jul 2023 17:11:07 +0200 Subject: MAINTAINERS: net: fix sort order Linus seems to like the MAINTAINERS file sorted, see c192ac735768 ("MAINTAINERS 2: Electric Boogaloo"). Since this is currently not the case, restore the sort order. Fixes: 3abf3d15ffff ("MAINTAINERS: ASP 2.0 Ethernet driver maintainers") Signed-off-by: Marc Kleine-Budde Acked-by: Justin Chen Reviewed-by: Randy Dunlap Acked-by: Florian Fainelli Link: https://lore.kernel.org/r/20230720151107.679668-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski --- MAINTAINERS | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 019b8017ab3c..d0553ad37865 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3838,6 +3838,15 @@ S: Maintained F: kernel/bpf/stackmap.c F: kernel/trace/bpf_trace.c +BROADCOM ASP 2.0 ETHERNET DRIVER +M: Justin Chen +M: Florian Fainelli +L: bcm-kernel-feedback-list@broadcom.com +L: netdev@vger.kernel.org +S: Supported +F: Documentation/devicetree/bindings/net/brcm,asp-v2.0.yaml +F: drivers/net/ethernet/broadcom/asp2/ + BROADCOM B44 10/100 ETHERNET DRIVER M: Michael Chan L: netdev@vger.kernel.org @@ -4155,15 +4164,6 @@ F: drivers/net/mdio/mdio-bcm-unimac.c F: include/linux/platform_data/bcmgenet.h F: include/linux/platform_data/mdio-bcm-unimac.h -BROADCOM ASP 2.0 ETHERNET DRIVER -M: Justin Chen -M: Florian Fainelli -L: bcm-kernel-feedback-list@broadcom.com -L: netdev@vger.kernel.org -S: Supported -F: Documentation/devicetree/bindings/net/brcm,asp-v2.0.yaml -F: drivers/net/ethernet/broadcom/asp2/ - BROADCOM IPROC ARM ARCHITECTURE M: Ray Jui M: Scott Branden -- cgit v1.2.3 From 7b2b20125f1e9ec49e3d13c41264afef70e70463 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Mon, 24 Jul 2023 22:41:00 -0700 Subject: MAINTAINERS: Replace my email address Switch from corporate email address to linux.dev address. Signed-off-by: Yonghong Song Link: https://lore.kernel.org/r/20230725054100.1013421-1-yonghong.song@linux.dev Signed-off-by: Martin KaFai Lau --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index d0553ad37865..cfd7c06a4b2d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3703,7 +3703,7 @@ M: Daniel Borkmann M: Andrii Nakryiko R: Martin KaFai Lau R: Song Liu -R: Yonghong Song +R: Yonghong Song R: John Fastabend R: KP Singh R: Stanislav Fomichev @@ -3742,7 +3742,7 @@ F: tools/lib/bpf/ F: tools/testing/selftests/bpf/ BPF [ITERATOR] -M: Yonghong Song +M: Yonghong Song L: bpf@vger.kernel.org S: Maintained F: kernel/bpf/*iter.c -- cgit v1.2.3 From e58ee933c27a2ddd587752788dd843817008313f Mon Sep 17 00:00:00 2001 From: Gerhard Uttenthaler Date: Thu, 20 Jul 2023 16:40:32 +0200 Subject: MAINTAINERS: Add myself as maintainer of the ems_pci.c driver At the suggestion of Marc Kleine-Budde [1], I add myself as maintainer of the ems_pci.c driver. [1] https://lore.kernel.org/all/20230720-purplish-quizzical-247024e66671-mkl@pengutronix.de Signed-off-by: Gerhard Uttenthaler Link: https://lore.kernel.org/all/20230720144032.28960-1-uttenthaler@ems-wuensche.com Signed-off-by: Marc Kleine-Budde --- MAINTAINERS | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 24eb7ed6211f..c4f95a9d03b9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7606,6 +7606,13 @@ L: linux-mmc@vger.kernel.org S: Supported F: drivers/mmc/host/cqhci* +EMS CPC-PCI CAN DRIVER +M: Gerhard Uttenthaler +M: support@ems-wuensche.com +L: linux-can@vger.kernel.org +S: Maintained +F: drivers/net/can/sja1000/ems_pci.c + EMULEX 10Gbps iSCSI - OneConnect DRIVER M: Ketan Mukadam L: linux-scsi@vger.kernel.org -- cgit v1.2.3 From 60495b6622ca67f5180343b89bd932d28d23f63a Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Tue, 1 Aug 2023 17:28:23 +0300 Subject: net: phy: provide phylib stubs for hardware timestamping operations net/core/dev_ioctl.c (built-in code) will want to call phy_mii_ioctl() for hardware timestamping purposes. This is not directly possible, because phy_mii_ioctl() is a symbol provided under CONFIG_PHYLIB. Do something similar to what was done in DSA in commit 5a17818682cf ("net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub"), and arrange some indirect calls to phy_mii_ioctl() through a stub structure containing function pointers, that's provided by phylib as built-in even when CONFIG_PHYLIB=m, and which phy_init() populates at runtime (module insertion). Note: maybe the ownership of the ethtool_phy_ops singleton is backwards, and the methods exposed by that should be later merged into phylib_stubs. Signed-off-by: Vladimir Oltean Link: https://lore.kernel.org/r/20230801142824.1772134-12-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- MAINTAINERS | 1 + drivers/net/phy/Makefile | 2 ++ drivers/net/phy/phy.c | 34 ++++++++++++++++++++++ drivers/net/phy/phy_device.c | 19 +++++++++++++ drivers/net/phy/stubs.c | 10 +++++++ include/linux/phy.h | 7 +++++ include/linux/phylib_stubs.h | 68 ++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 141 insertions(+) create mode 100644 drivers/net/phy/stubs.c create mode 100644 include/linux/phylib_stubs.h (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index c4f95a9d03b9..069e176d607a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7752,6 +7752,7 @@ F: include/linux/mii.h F: include/linux/of_net.h F: include/linux/phy.h F: include/linux/phy_fixed.h +F: include/linux/phylib_stubs.h F: include/linux/platform_data/mdio-bcm-unimac.h F: include/linux/platform_data/mdio-gpio.h F: include/trace/events/mdio.h diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index 35142780fc9d..c945ed9bd14b 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -14,6 +14,8 @@ endif # dedicated loadable module, so we bundle them all together into libphy.ko ifdef CONFIG_PHYLIB libphy-y += $(mdio-bus-y) +# the stubs are built-in whenever PHYLIB is built-in or module +obj-y += stubs.o else obj-$(CONFIG_MDIO_DEVICE) += mdio-bus.o endif diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index bdf00b2b2c1d..8aec8e83038c 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -455,6 +455,40 @@ int phy_do_ioctl_running(struct net_device *dev, struct ifreq *ifr, int cmd) } EXPORT_SYMBOL(phy_do_ioctl_running); +/** + * __phy_hwtstamp_get - Get hardware timestamping configuration from PHY + * + * @phydev: the PHY device structure + * @config: structure holding the timestamping configuration + * + * Query the PHY device for its current hardware timestamping configuration. + */ +int __phy_hwtstamp_get(struct phy_device *phydev, + struct kernel_hwtstamp_config *config) +{ + if (!phydev) + return -ENODEV; + + return phy_mii_ioctl(phydev, config->ifr, SIOCGHWTSTAMP); +} + +/** + * __phy_hwtstamp_set - Modify PHY hardware timestamping configuration + * + * @phydev: the PHY device structure + * @config: structure holding the timestamping configuration + * @extack: netlink extended ack structure, for error reporting + */ +int __phy_hwtstamp_set(struct phy_device *phydev, + struct kernel_hwtstamp_config *config, + struct netlink_ext_ack *extack) +{ + if (!phydev) + return -ENODEV; + + return phy_mii_ioctl(phydev, config->ifr, SIOCSHWTSTAMP); +} + /** * phy_queue_state_machine - Trigger the state machine to run soon * diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 98b8ac28e5a1..e19c4fee8d22 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -3448,12 +3449,28 @@ static const struct ethtool_phy_ops phy_ethtool_phy_ops = { .start_cable_test_tdr = phy_start_cable_test_tdr, }; +static const struct phylib_stubs __phylib_stubs = { + .hwtstamp_get = __phy_hwtstamp_get, + .hwtstamp_set = __phy_hwtstamp_set, +}; + +static void phylib_register_stubs(void) +{ + phylib_stubs = &__phylib_stubs; +} + +static void phylib_unregister_stubs(void) +{ + phylib_stubs = NULL; +} + static int __init phy_init(void) { int rc; rtnl_lock(); ethtool_set_ethtool_phy_ops(&phy_ethtool_phy_ops); + phylib_register_stubs(); rtnl_unlock(); rc = mdio_bus_init(); @@ -3478,6 +3495,7 @@ err_mdio_bus: mdio_bus_exit(); err_ethtool_phy_ops: rtnl_lock(); + phylib_unregister_stubs(); ethtool_set_ethtool_phy_ops(NULL); rtnl_unlock(); @@ -3490,6 +3508,7 @@ static void __exit phy_exit(void) phy_driver_unregister(&genphy_driver); mdio_bus_exit(); rtnl_lock(); + phylib_unregister_stubs(); ethtool_set_ethtool_phy_ops(NULL); rtnl_unlock(); } diff --git a/drivers/net/phy/stubs.c b/drivers/net/phy/stubs.c new file mode 100644 index 000000000000..cfb9f275eb18 --- /dev/null +++ b/drivers/net/phy/stubs.c @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Stubs for PHY library functionality called by the core network stack. + * These are necessary because CONFIG_PHYLIB can be a module, and built-in + * code cannot directly call symbols exported by modules. + */ +#include + +const struct phylib_stubs *phylib_stubs; +EXPORT_SYMBOL_GPL(phylib_stubs); diff --git a/include/linux/phy.h b/include/linux/phy.h index b254848a9c99..ba08b0e60279 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -298,6 +298,7 @@ static inline const char *phy_modes(phy_interface_t interface) #define MII_BUS_ID_SIZE 61 struct device; +struct kernel_hwtstamp_config; struct phylink; struct sfp_bus; struct sfp_upstream_ops; @@ -1955,6 +1956,12 @@ int phy_ethtool_set_plca_cfg(struct phy_device *phydev, int phy_ethtool_get_plca_status(struct phy_device *phydev, struct phy_plca_status *plca_st); +int __phy_hwtstamp_get(struct phy_device *phydev, + struct kernel_hwtstamp_config *config); +int __phy_hwtstamp_set(struct phy_device *phydev, + struct kernel_hwtstamp_config *config, + struct netlink_ext_ack *extack); + static inline int phy_package_read(struct phy_device *phydev, u32 regnum) { struct phy_package_shared *shared = phydev->shared; diff --git a/include/linux/phylib_stubs.h b/include/linux/phylib_stubs.h new file mode 100644 index 000000000000..1279f48c8a70 --- /dev/null +++ b/include/linux/phylib_stubs.h @@ -0,0 +1,68 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Stubs for the Network PHY library + */ + +#include + +struct kernel_hwtstamp_config; +struct netlink_ext_ack; +struct phy_device; + +#if IS_ENABLED(CONFIG_PHYLIB) + +extern const struct phylib_stubs *phylib_stubs; + +struct phylib_stubs { + int (*hwtstamp_get)(struct phy_device *phydev, + struct kernel_hwtstamp_config *config); + int (*hwtstamp_set)(struct phy_device *phydev, + struct kernel_hwtstamp_config *config, + struct netlink_ext_ack *extack); +}; + +static inline int phy_hwtstamp_get(struct phy_device *phydev, + struct kernel_hwtstamp_config *config) +{ + /* phylib_register_stubs() and phylib_unregister_stubs() + * also run under rtnl_lock(). + */ + ASSERT_RTNL(); + + if (!phylib_stubs) + return -EOPNOTSUPP; + + return phylib_stubs->hwtstamp_get(phydev, config); +} + +static inline int phy_hwtstamp_set(struct phy_device *phydev, + struct kernel_hwtstamp_config *config, + struct netlink_ext_ack *extack) +{ + /* phylib_register_stubs() and phylib_unregister_stubs() + * also run under rtnl_lock(). + */ + ASSERT_RTNL(); + + if (!phylib_stubs) + return -EOPNOTSUPP; + + return phylib_stubs->hwtstamp_set(phydev, config, extack); +} + +#else + +static inline int phy_hwtstamp_get(struct phy_device *phydev, + struct kernel_hwtstamp_config *config) +{ + return -EOPNOTSUPP; +} + +static inline int phy_hwtstamp_set(struct phy_device *phydev, + struct kernel_hwtstamp_config *config, + struct netlink_ext_ack *extack) +{ + return -EOPNOTSUPP; +} + +#endif -- cgit v1.2.3 From a9ca9f9ceff382b58b488248f0c0da9e157f5d06 Mon Sep 17 00:00:00 2001 From: Yunsheng Lin Date: Fri, 4 Aug 2023 20:05:24 +0200 Subject: page_pool: split types and declarations from page_pool.h Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin Suggested-by: Jakub Kicinski Signed-off-by: Alexander Lobakin Reviewed-by: Alexander Duyck Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski --- Documentation/networking/page_pool.rst | 6 +- MAINTAINERS | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +- drivers/net/ethernet/engleder/tsnep_main.c | 1 + drivers/net/ethernet/freescale/fec_main.c | 1 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 1 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +- drivers/net/ethernet/marvell/mvneta.c | 2 +- drivers/net/ethernet/marvell/mvpp2/mvpp2.h | 2 +- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 + .../ethernet/marvell/octeontx2/nic/otx2_common.c | 1 + .../net/ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 + drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 + drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +- .../net/ethernet/mellanox/mlx5/core/en/params.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en/trap.c | 1 - drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 2 +- .../net/ethernet/microchip/lan966x/lan966x_fdma.c | 1 + .../net/ethernet/microchip/lan966x/lan966x_main.h | 2 +- drivers/net/ethernet/microsoft/mana/mana_en.c | 1 + drivers/net/ethernet/socionext/netsec.c | 2 +- drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 1 + drivers/net/ethernet/ti/cpsw.c | 2 +- drivers/net/ethernet/ti/cpsw_new.c | 2 +- drivers/net/ethernet/ti/cpsw_priv.c | 2 +- drivers/net/ethernet/wangxun/libwx/wx_lib.c | 2 +- drivers/net/veth.c | 2 +- drivers/net/wireless/mediatek/mt76/mac80211.c | 1 - drivers/net/wireless/mediatek/mt76/mt76.h | 1 + drivers/net/xen-netfront.c | 2 +- include/linux/skbuff.h | 2 +- include/net/page_pool.h | 470 --------------------- include/net/page_pool/helpers.h | 238 +++++++++++ include/net/page_pool/types.h | 238 +++++++++++ include/trace/events/page_pool.h | 2 +- net/bpf/test_run.c | 2 +- net/core/page_pool.c | 2 +- net/core/skbuff.c | 2 +- net/core/xdp.c | 2 +- 44 files changed, 517 insertions(+), 500 deletions(-) delete mode 100644 include/net/page_pool.h create mode 100644 include/net/page_pool/helpers.h create mode 100644 include/net/page_pool/types.h (limited to 'MAINTAINERS') diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index 53b5448cc0f1..68b82cea13e4 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -67,10 +67,10 @@ a page will cause no race conditions is enough. .. kernel-doc:: net/core/page_pool.c :identifiers: page_pool_create -.. kernel-doc:: include/net/page_pool.h +.. kernel-doc:: include/net/page_pool/types.h :identifiers: struct page_pool_params -.. kernel-doc:: include/net/page_pool.h +.. kernel-doc:: include/net/page_pool/helpers.h :identifiers: page_pool_put_page page_pool_put_full_page page_pool_recycle_direct page_pool_dev_alloc_pages page_pool_get_dma_addr page_pool_get_dma_dir @@ -122,7 +122,7 @@ page_pool_stats allocated by the caller. The API will fill in the provided struct page_pool_stats with statistics about the page_pool. -.. kernel-doc:: include/net/page_pool.h +.. kernel-doc:: include/net/page_pool/types.h :identifiers: struct page_pool_recycle_stats struct page_pool_alloc_stats struct page_pool_stats diff --git a/MAINTAINERS b/MAINTAINERS index 5e2bb1059ab6..08bcf3a7c482 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16020,7 +16020,7 @@ M: Ilias Apalodimas L: netdev@vger.kernel.org S: Supported F: Documentation/networking/page_pool.rst -F: include/net/page_pool.h +F: include/net/page_pool/ F: include/trace/events/page_pool.h F: net/core/page_pool.c diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 6a643aae7802..eb168ca983b7 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -54,7 +54,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 2ce46d7affe4..96f5ca778c67 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -15,7 +15,7 @@ #include #include #include -#include +#include #include "bnxt_hsi.h" #include "bnxt.h" #include "bnxt_xdp.h" diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c index 079f9f6ae21a..f61bd89734c5 100644 --- a/drivers/net/ethernet/engleder/tsnep_main.c +++ b/drivers/net/ethernet/engleder/tsnep_main.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 43f14cec91e9..3bd0bf03aedb 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 9f6890059666..e5e37a33fd81 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 88af34bbee34..acd756b0c7c9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -6,7 +6,7 @@ #include #include -#include +#include #include #include "hnae3.h" diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index acf4f6ba73a6..d483b8c00ec0 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -37,7 +37,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index 11e603686a27..e809f91c08fb 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -16,7 +16,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 9e1b596c8f08..eb74ccddb440 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c index 8cdd92dd9762..8336cea16aff 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c @@ -7,6 +7,7 @@ #include #include +#include #include #include diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c index 61f62a6ec662..70b9065f7d10 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "otx2_reg.h" #include "otx2_common.h" diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 1b89f800f6df..fe05c9020269 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "mtk_eth_soc.h" #include "mtk_wed.h" diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h index 80d17729e557..4a2470fbad2c 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include "mtk_ppe.h" diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c index 5ce28ff7685f..e097f336e1c4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c @@ -6,6 +6,7 @@ #include "en/port.h" #include "en_accel/en_accel.h" #include "en_accel/ipsec.h" +#include #include static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c index 201ac7dd338f..698647cc8c0f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB /* Copyright (c) 2020 Mellanox Technologies */ -#include #include "en/txrx.h" #include "en/params.h" #include "en/trap.h" diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c index 40589cebb773..12f56d0db0af 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c @@ -35,6 +35,7 @@ #include "en/xdp.h" #include "en/params.h" #include +#include int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 1c820119e438..c8ec6467d4d1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -38,7 +38,7 @@ #include #include #include -#include +#include #include #include #include "eswitch.h" diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index f7bb5f4aaaca..3fd11b0761e0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -36,7 +36,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c index 4d77055abd4b..07b84d668fcc 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c @@ -38,7 +38,7 @@ #include "en/port.h" #ifdef CONFIG_PAGE_POOL_STATS -#include +#include #endif static unsigned int stats_grps_num(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c index bd72fbc2220f..3960534ac2ad 100644 --- a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c @@ -2,6 +2,7 @@ #include #include +#include #include "lan966x_main.h" diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h index aebc9154693a..caa9e0533c96 100644 --- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h @@ -10,7 +10,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index a08023c57e25..31e2f2c74e15 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -11,6 +11,7 @@ #include #include +#include #include #include diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c index 0dcd6a568b06..f358ea003193 100644 --- a/drivers/net/ethernet/socionext/netsec.c +++ b/drivers/net/ethernet/socionext/netsec.c @@ -15,7 +15,7 @@ #include #include -#include +#include #include #define NETSEC_REG_SOFT_RST 0x104 diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index a6d034968654..3401e888a9f6 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -21,7 +21,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 99aa5360b3ff..fcab363d8dfa 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include "stmmac_ptp.h" diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index f9cd566d1c9b..ca4d4548f85e 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -31,7 +31,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c index c61e4e44a78f..0e4f526b1753 100644 --- a/drivers/net/ethernet/ti/cpsw_new.c +++ b/drivers/net/ethernet/ti/cpsw_new.c @@ -30,7 +30,7 @@ #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c index ae52cdbcf8cc..0ec85635dfd6 100644 --- a/drivers/net/ethernet/ti/cpsw_priv.c +++ b/drivers/net/ethernet/ti/cpsw_priv.c @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c index 2c3f08be8c37..e04d4a5eed7b 100644 --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c @@ -3,7 +3,7 @@ #include #include -#include +#include #include #include #include diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 614f3e3efab0..953f6d8f8db0 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -26,7 +26,7 @@ #include #include #include -#include +#include #define DRV_NAME "veth" #define DRV_VERSION "1.0" diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c index c0ff36a98bed..d158320bc15d 100644 --- a/drivers/net/wireless/mediatek/mt76/mac80211.c +++ b/drivers/net/wireless/mediatek/mt76/mac80211.c @@ -4,7 +4,6 @@ */ #include #include -#include #include "mt76.h" #define CHAN2G(_idx, _freq) { \ diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h index 878087257ea7..e8757865a3d0 100644 --- a/drivers/net/wireless/mediatek/mt76/mt76.h +++ b/drivers/net/wireless/mediatek/mt76/mt76.h @@ -15,6 +15,7 @@ #include #include #include +#include #include "util.h" #include "testmode.h" diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 47d54d8ea59d..ad29f370034e 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -45,7 +45,7 @@ #include #include #include -#include +#include #include #include diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 16a49ba534e4..888e3d7e74c1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -32,7 +32,7 @@ #include #include #include -#include +#include #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include #endif diff --git a/include/net/page_pool.h b/include/net/page_pool.h deleted file mode 100644 index 73d4f786418d..000000000000 --- a/include/net/page_pool.h +++ /dev/null @@ -1,470 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 - * - * page_pool.h - * Author: Jesper Dangaard Brouer - * Copyright (C) 2016 Red Hat, Inc. - */ - -/** - * DOC: page_pool allocator - * - * This page_pool allocator is optimized for the XDP mode that - * uses one-frame-per-page, but have fallbacks that act like the - * regular page allocator APIs. - * - * Basic use involve replacing alloc_pages() calls with the - * page_pool_alloc_pages() call. Drivers should likely use - * page_pool_dev_alloc_pages() replacing dev_alloc_pages(). - * - * API keeps track of in-flight pages, in-order to let API user know - * when it is safe to dealloactor page_pool object. Thus, API users - * must call page_pool_put_page() where appropriate and only attach - * the page to a page_pool-aware objects, like skbs marked for recycling. - * - * API user must only call page_pool_put_page() once on a page, as it - * will either recycle the page, or in case of elevated refcnt, it - * will release the DMA mapping and in-flight state accounting. We - * hope to lift this requirement in the future. - */ -#ifndef _NET_PAGE_POOL_H -#define _NET_PAGE_POOL_H - -#include /* Needed by ptr_ring */ -#include -#include - -#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA - * map/unmap - */ -#define PP_FLAG_DMA_SYNC_DEV BIT(1) /* If set all pages that the driver gets - * from page_pool will be - * DMA-synced-for-device according to - * the length provided by the device - * driver. - * Please note DMA-sync-for-CPU is still - * device driver responsibility - */ -#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */ -#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\ - PP_FLAG_DMA_SYNC_DEV |\ - PP_FLAG_PAGE_FRAG) - -/* - * Fast allocation side cache array/stack - * - * The cache size and refill watermark is related to the network - * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX - * ring is usually refilled and the max consumed elements will be 64, - * thus a natural max size of objects needed in the cache. - * - * Keeping room for more objects, is due to XDP_DROP use-case. As - * XDP_DROP allows the opportunity to recycle objects directly into - * this array, as it shares the same softirq/NAPI protection. If - * cache is already full (or partly full) then the XDP_DROP recycles - * would have to take a slower code path. - */ -#define PP_ALLOC_CACHE_SIZE 128 -#define PP_ALLOC_CACHE_REFILL 64 -struct pp_alloc_cache { - u32 count; - struct page *cache[PP_ALLOC_CACHE_SIZE]; -}; - -/** - * struct page_pool_params - page pool parameters - * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_PAGE_FRAG - * @order: 2^order pages on allocation - * @pool_size: size of the ptr_ring - * @nid: NUMA node id to allocate from pages from - * @dev: device, for DMA pre-mapping purposes - * @napi: NAPI which is the sole consumer of pages, otherwise NULL - * @dma_dir: DMA mapping direction - * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV - * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV - */ -struct page_pool_params { - unsigned int flags; - unsigned int order; - unsigned int pool_size; - int nid; - struct device *dev; - struct napi_struct *napi; - enum dma_data_direction dma_dir; - unsigned int max_len; - unsigned int offset; -/* private: used by test code only */ - void (*init_callback)(struct page *page, void *arg); - void *init_arg; -}; - -#ifdef CONFIG_PAGE_POOL_STATS -/** - * struct page_pool_alloc_stats - allocation statistics - * @fast: successful fast path allocations - * @slow: slow path order-0 allocations - * @slow_high_order: slow path high order allocations - * @empty: ptr ring is empty, so a slow path allocation was forced - * @refill: an allocation which triggered a refill of the cache - * @waive: pages obtained from the ptr ring that cannot be added to - * the cache due to a NUMA mismatch - */ -struct page_pool_alloc_stats { - u64 fast; - u64 slow; - u64 slow_high_order; - u64 empty; - u64 refill; - u64 waive; -}; - -/** - * struct page_pool_recycle_stats - recycling (freeing) statistics - * @cached: recycling placed page in the page pool cache - * @cache_full: page pool cache was full - * @ring: page placed into the ptr ring - * @ring_full: page released from page pool because the ptr ring was full - * @released_refcnt: page released (and not recycled) because refcnt > 1 - */ -struct page_pool_recycle_stats { - u64 cached; - u64 cache_full; - u64 ring; - u64 ring_full; - u64 released_refcnt; -}; - -/** - * struct page_pool_stats - combined page pool use statistics - * @alloc_stats: see struct page_pool_alloc_stats - * @recycle_stats: see struct page_pool_recycle_stats - * - * Wrapper struct for combining page pool stats with different storage - * requirements. - */ -struct page_pool_stats { - struct page_pool_alloc_stats alloc_stats; - struct page_pool_recycle_stats recycle_stats; -}; - -int page_pool_ethtool_stats_get_count(void); -u8 *page_pool_ethtool_stats_get_strings(u8 *data); -u64 *page_pool_ethtool_stats_get(u64 *data, void *stats); - -/* - * Drivers that wish to harvest page pool stats and report them to users - * (perhaps via ethtool, debugfs, or another mechanism) can allocate a - * struct page_pool_stats call page_pool_get_stats to get stats for the specified pool. - */ -bool page_pool_get_stats(struct page_pool *pool, - struct page_pool_stats *stats); -#else - -static inline int page_pool_ethtool_stats_get_count(void) -{ - return 0; -} - -static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data) -{ - return data; -} - -static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) -{ - return data; -} - -#endif - -struct page_pool { - struct page_pool_params p; - - struct delayed_work release_dw; - void (*disconnect)(void *); - unsigned long defer_start; - unsigned long defer_warn; - - u32 pages_state_hold_cnt; - unsigned int frag_offset; - struct page *frag_page; - long frag_users; - -#ifdef CONFIG_PAGE_POOL_STATS - /* these stats are incremented while in softirq context */ - struct page_pool_alloc_stats alloc_stats; -#endif - u32 xdp_mem_id; - - /* - * Data structure for allocation side - * - * Drivers allocation side usually already perform some kind - * of resource protection. Piggyback on this protection, and - * require driver to protect allocation side. - * - * For NIC drivers this means, allocate a page_pool per - * RX-queue. As the RX-queue is already protected by - * Softirq/BH scheduling and napi_schedule. NAPI schedule - * guarantee that a single napi_struct will only be scheduled - * on a single CPU (see napi_schedule). - */ - struct pp_alloc_cache alloc ____cacheline_aligned_in_smp; - - /* Data structure for storing recycled pages. - * - * Returning/freeing pages is more complicated synchronization - * wise, because free's can happen on remote CPUs, with no - * association with allocation resource. - * - * Use ptr_ring, as it separates consumer and producer - * effeciently, it a way that doesn't bounce cache-lines. - * - * TODO: Implement bulk return pages into this structure. - */ - struct ptr_ring ring; - -#ifdef CONFIG_PAGE_POOL_STATS - /* recycle stats are per-cpu to avoid locking */ - struct page_pool_recycle_stats __percpu *recycle_stats; -#endif - atomic_t pages_state_release_cnt; - - /* A page_pool is strictly tied to a single RX-queue being - * protected by NAPI, due to above pp_alloc_cache. This - * refcnt serves purpose is to simplify drivers error handling. - */ - refcount_t user_cnt; - - u64 destroy_cnt; -}; - -struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); - -/** - * page_pool_dev_alloc_pages() - allocate a page. - * @pool: pool from which to allocate - * - * Get a page from the page allocator or page_pool caches. - */ -static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) -{ - gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); - - return page_pool_alloc_pages(pool, gfp); -} - -struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, - unsigned int size, gfp_t gfp); - -static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, - unsigned int *offset, - unsigned int size) -{ - gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); - - return page_pool_alloc_frag(pool, offset, size, gfp); -} - -/** - * page_pool_get_dma_dir() - Retrieve the stored DMA direction. - * @pool: pool from which page was allocated - * - * Get the stored dma direction. A driver might decide to store this locally - * and avoid the extra cache line from page_pool to determine the direction. - */ -static -inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) -{ - return pool->p.dma_dir; -} - -bool page_pool_return_skb_page(struct page *page, bool napi_safe); - -struct page_pool *page_pool_create(const struct page_pool_params *params); - -struct xdp_mem_info; - -#ifdef CONFIG_PAGE_POOL -void page_pool_unlink_napi(struct page_pool *pool); -void page_pool_destroy(struct page_pool *pool); -void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *), - struct xdp_mem_info *mem); -void page_pool_put_page_bulk(struct page_pool *pool, void **data, - int count); -#else -static inline void page_pool_unlink_napi(struct page_pool *pool) -{ -} - -static inline void page_pool_destroy(struct page_pool *pool) -{ -} - -static inline void page_pool_use_xdp_mem(struct page_pool *pool, - void (*disconnect)(void *), - struct xdp_mem_info *mem) -{ -} - -static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data, - int count) -{ -} -#endif - -void page_pool_put_defragged_page(struct page_pool *pool, struct page *page, - unsigned int dma_sync_size, - bool allow_direct); - -/* pp_frag_count represents the number of writers who can update the page - * either by updating skb->data or via DMA mappings for the device. - * We can't rely on the page refcnt for that as we don't know who might be - * holding page references and we can't reliably destroy or sync DMA mappings - * of the fragments. - * - * When pp_frag_count reaches 0 we can either recycle the page if the page - * refcnt is 1 or return it back to the memory allocator and destroy any - * mappings we have. - */ -static inline void page_pool_fragment_page(struct page *page, long nr) -{ - atomic_long_set(&page->pp_frag_count, nr); -} - -static inline long page_pool_defrag_page(struct page *page, long nr) -{ - long ret; - - /* If nr == pp_frag_count then we have cleared all remaining - * references to the page. No need to actually overwrite it, instead - * we can leave this to be overwritten by the calling function. - * - * The main advantage to doing this is that an atomic_read is - * generally a much cheaper operation than an atomic update, - * especially when dealing with a page that may be partitioned - * into only 2 or 3 pieces. - */ - if (atomic_long_read(&page->pp_frag_count) == nr) - return 0; - - ret = atomic_long_sub_return(nr, &page->pp_frag_count); - WARN_ON(ret < 0); - return ret; -} - -static inline bool page_pool_is_last_frag(struct page_pool *pool, - struct page *page) -{ - /* If fragments aren't enabled or count is 0 we were the last user */ - return !(pool->p.flags & PP_FLAG_PAGE_FRAG) || - (page_pool_defrag_page(page, 1) == 0); -} - -/** - * page_pool_put_page() - release a reference to a page pool page - * @pool: pool from which page was allocated - * @page: page to release a reference on - * @dma_sync_size: how much of the page may have been touched by the device - * @allow_direct: released by the consumer, allow lockless caching - * - * The outcome of this depends on the page refcnt. If the driver bumps - * the refcnt > 1 this will unmap the page. If the page refcnt is 1 - * the allocator owns the page and will try to recycle it in one of the pool - * caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device - * using dma_sync_single_range_for_device(). - */ -static inline void page_pool_put_page(struct page_pool *pool, - struct page *page, - unsigned int dma_sync_size, - bool allow_direct) -{ - /* When page_pool isn't compiled-in, net/core/xdp.c doesn't - * allow registering MEM_TYPE_PAGE_POOL, but shield linker. - */ -#ifdef CONFIG_PAGE_POOL - if (!page_pool_is_last_frag(pool, page)) - return; - - page_pool_put_defragged_page(pool, page, dma_sync_size, allow_direct); -#endif -} - -/** - * page_pool_put_full_page() - release a reference on a page pool page - * @pool: pool from which page was allocated - * @page: page to release a reference on - * @allow_direct: released by the consumer, allow lockless caching - * - * Similar to page_pool_put_page(), but will DMA sync the entire memory area - * as configured in &page_pool_params.max_len. - */ -static inline void page_pool_put_full_page(struct page_pool *pool, - struct page *page, bool allow_direct) -{ - page_pool_put_page(pool, page, -1, allow_direct); -} - -/** - * page_pool_recycle_direct() - release a reference on a page pool page - * @pool: pool from which page was allocated - * @page: page to release a reference on - * - * Similar to page_pool_put_full_page() but caller must guarantee safe context - * (e.g NAPI), since it will recycle the page directly into the pool fast cache. - */ -static inline void page_pool_recycle_direct(struct page_pool *pool, - struct page *page) -{ - page_pool_put_full_page(pool, page, true); -} - -#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \ - (sizeof(dma_addr_t) > sizeof(unsigned long)) - -/** - * page_pool_get_dma_addr() - Retrieve the stored DMA address. - * @page: page allocated from a page pool - * - * Fetch the DMA address of the page. The page pool to which the page belongs - * must had been created with PP_FLAG_DMA_MAP. - */ -static inline dma_addr_t page_pool_get_dma_addr(struct page *page) -{ - dma_addr_t ret = page->dma_addr; - - if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) - ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; - - return ret; -} - -static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) -{ - page->dma_addr = addr; - if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) - page->dma_addr_upper = upper_32_bits(addr); -} - -static inline bool is_page_pool_compiled_in(void) -{ -#ifdef CONFIG_PAGE_POOL - return true; -#else - return false; -#endif -} - -static inline bool page_pool_put(struct page_pool *pool) -{ - return refcount_dec_and_test(&pool->user_cnt); -} - -/* Caller must provide appropriate safe context, e.g. NAPI. */ -void page_pool_update_nid(struct page_pool *pool, int new_nid); -static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) -{ - if (unlikely(pool->p.nid != new_nid)) - page_pool_update_nid(pool, new_nid); -} - -#endif /* _NET_PAGE_POOL_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h new file mode 100644 index 000000000000..78df91804c87 --- /dev/null +++ b/include/net/page_pool/helpers.h @@ -0,0 +1,238 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * page_pool/helpers.h + * Author: Jesper Dangaard Brouer + * Copyright (C) 2016 Red Hat, Inc. + */ + +/** + * DOC: page_pool allocator + * + * This page_pool allocator is optimized for the XDP mode that + * uses one-frame-per-page, but have fallbacks that act like the + * regular page allocator APIs. + * + * Basic use involve replacing alloc_pages() calls with the + * page_pool_alloc_pages() call. Drivers should likely use + * page_pool_dev_alloc_pages() replacing dev_alloc_pages(). + * + * API keeps track of in-flight pages, in-order to let API user know + * when it is safe to dealloactor page_pool object. Thus, API users + * must call page_pool_put_page() where appropriate and only attach + * the page to a page_pool-aware objects, like skbs marked for recycling. + * + * API user must only call page_pool_put_page() once on a page, as it + * will either recycle the page, or in case of elevated refcnt, it + * will release the DMA mapping and in-flight state accounting. We + * hope to lift this requirement in the future. + */ +#ifndef _NET_PAGE_POOL_HELPERS_H +#define _NET_PAGE_POOL_HELPERS_H + +#include + +#ifdef CONFIG_PAGE_POOL_STATS +int page_pool_ethtool_stats_get_count(void); +u8 *page_pool_ethtool_stats_get_strings(u8 *data); +u64 *page_pool_ethtool_stats_get(u64 *data, void *stats); + +/* + * Drivers that wish to harvest page pool stats and report them to users + * (perhaps via ethtool, debugfs, or another mechanism) can allocate a + * struct page_pool_stats call page_pool_get_stats to get stats for the specified pool. + */ +bool page_pool_get_stats(struct page_pool *pool, + struct page_pool_stats *stats); +#else +static inline int page_pool_ethtool_stats_get_count(void) +{ + return 0; +} + +static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data) +{ + return data; +} + +static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) +{ + return data; +} +#endif + +/** + * page_pool_dev_alloc_pages() - allocate a page. + * @pool: pool from which to allocate + * + * Get a page from the page allocator or page_pool caches. + */ +static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) +{ + gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); + + return page_pool_alloc_pages(pool, gfp); +} + +static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, + unsigned int *offset, + unsigned int size) +{ + gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); + + return page_pool_alloc_frag(pool, offset, size, gfp); +} + +/** + * page_pool_get_dma_dir() - Retrieve the stored DMA direction. + * @pool: pool from which page was allocated + * + * Get the stored dma direction. A driver might decide to store this locally + * and avoid the extra cache line from page_pool to determine the direction. + */ +static +inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) +{ + return pool->p.dma_dir; +} + +/* pp_frag_count represents the number of writers who can update the page + * either by updating skb->data or via DMA mappings for the device. + * We can't rely on the page refcnt for that as we don't know who might be + * holding page references and we can't reliably destroy or sync DMA mappings + * of the fragments. + * + * When pp_frag_count reaches 0 we can either recycle the page if the page + * refcnt is 1 or return it back to the memory allocator and destroy any + * mappings we have. + */ +static inline void page_pool_fragment_page(struct page *page, long nr) +{ + atomic_long_set(&page->pp_frag_count, nr); +} + +static inline long page_pool_defrag_page(struct page *page, long nr) +{ + long ret; + + /* If nr == pp_frag_count then we have cleared all remaining + * references to the page. No need to actually overwrite it, instead + * we can leave this to be overwritten by the calling function. + * + * The main advantage to doing this is that an atomic_read is + * generally a much cheaper operation than an atomic update, + * especially when dealing with a page that may be partitioned + * into only 2 or 3 pieces. + */ + if (atomic_long_read(&page->pp_frag_count) == nr) + return 0; + + ret = atomic_long_sub_return(nr, &page->pp_frag_count); + WARN_ON(ret < 0); + return ret; +} + +static inline bool page_pool_is_last_frag(struct page_pool *pool, + struct page *page) +{ + /* If fragments aren't enabled or count is 0 we were the last user */ + return !(pool->p.flags & PP_FLAG_PAGE_FRAG) || + (page_pool_defrag_page(page, 1) == 0); +} + +/** + * page_pool_put_page() - release a reference to a page pool page + * @pool: pool from which page was allocated + * @page: page to release a reference on + * @dma_sync_size: how much of the page may have been touched by the device + * @allow_direct: released by the consumer, allow lockless caching + * + * The outcome of this depends on the page refcnt. If the driver bumps + * the refcnt > 1 this will unmap the page. If the page refcnt is 1 + * the allocator owns the page and will try to recycle it in one of the pool + * caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device + * using dma_sync_single_range_for_device(). + */ +static inline void page_pool_put_page(struct page_pool *pool, + struct page *page, + unsigned int dma_sync_size, + bool allow_direct) +{ + /* When page_pool isn't compiled-in, net/core/xdp.c doesn't + * allow registering MEM_TYPE_PAGE_POOL, but shield linker. + */ +#ifdef CONFIG_PAGE_POOL + if (!page_pool_is_last_frag(pool, page)) + return; + + page_pool_put_defragged_page(pool, page, dma_sync_size, allow_direct); +#endif +} + +/** + * page_pool_put_full_page() - release a reference on a page pool page + * @pool: pool from which page was allocated + * @page: page to release a reference on + * @allow_direct: released by the consumer, allow lockless caching + * + * Similar to page_pool_put_page(), but will DMA sync the entire memory area + * as configured in &page_pool_params.max_len. + */ +static inline void page_pool_put_full_page(struct page_pool *pool, + struct page *page, bool allow_direct) +{ + page_pool_put_page(pool, page, -1, allow_direct); +} + +/** + * page_pool_recycle_direct() - release a reference on a page pool page + * @pool: pool from which page was allocated + * @page: page to release a reference on + * + * Similar to page_pool_put_full_page() but caller must guarantee safe context + * (e.g NAPI), since it will recycle the page directly into the pool fast cache. + */ +static inline void page_pool_recycle_direct(struct page_pool *pool, + struct page *page) +{ + page_pool_put_full_page(pool, page, true); +} + +#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \ + (sizeof(dma_addr_t) > sizeof(unsigned long)) + +/** + * page_pool_get_dma_addr() - Retrieve the stored DMA address. + * @page: page allocated from a page pool + * + * Fetch the DMA address of the page. The page pool to which the page belongs + * must had been created with PP_FLAG_DMA_MAP. + */ +static inline dma_addr_t page_pool_get_dma_addr(struct page *page) +{ + dma_addr_t ret = page->dma_addr; + + if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; + + return ret; +} + +static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) +{ + page->dma_addr = addr; + if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + page->dma_addr_upper = upper_32_bits(addr); +} + +static inline bool page_pool_put(struct page_pool *pool) +{ + return refcount_dec_and_test(&pool->user_cnt); +} + +static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) +{ + if (unlikely(pool->p.nid != new_nid)) + page_pool_update_nid(pool, new_nid); +} + +#endif /* _NET_PAGE_POOL_HELPERS_H */ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h new file mode 100644 index 000000000000..9ac39191bed7 --- /dev/null +++ b/include/net/page_pool/types.h @@ -0,0 +1,238 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _NET_PAGE_POOL_TYPES_H +#define _NET_PAGE_POOL_TYPES_H + +#include +#include + +#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA + * map/unmap + */ +#define PP_FLAG_DMA_SYNC_DEV BIT(1) /* If set all pages that the driver gets + * from page_pool will be + * DMA-synced-for-device according to + * the length provided by the device + * driver. + * Please note DMA-sync-for-CPU is still + * device driver responsibility + */ +#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */ +#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\ + PP_FLAG_DMA_SYNC_DEV |\ + PP_FLAG_PAGE_FRAG) + +/* + * Fast allocation side cache array/stack + * + * The cache size and refill watermark is related to the network + * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX + * ring is usually refilled and the max consumed elements will be 64, + * thus a natural max size of objects needed in the cache. + * + * Keeping room for more objects, is due to XDP_DROP use-case. As + * XDP_DROP allows the opportunity to recycle objects directly into + * this array, as it shares the same softirq/NAPI protection. If + * cache is already full (or partly full) then the XDP_DROP recycles + * would have to take a slower code path. + */ +#define PP_ALLOC_CACHE_SIZE 128 +#define PP_ALLOC_CACHE_REFILL 64 +struct pp_alloc_cache { + u32 count; + struct page *cache[PP_ALLOC_CACHE_SIZE]; +}; + +/** + * struct page_pool_params - page pool parameters + * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_PAGE_FRAG + * @order: 2^order pages on allocation + * @pool_size: size of the ptr_ring + * @nid: NUMA node id to allocate from pages from + * @dev: device, for DMA pre-mapping purposes + * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @dma_dir: DMA mapping direction + * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV + * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV + */ +struct page_pool_params { + unsigned int flags; + unsigned int order; + unsigned int pool_size; + int nid; + struct device *dev; + struct napi_struct *napi; + enum dma_data_direction dma_dir; + unsigned int max_len; + unsigned int offset; +/* private: used by test code only */ + void (*init_callback)(struct page *page, void *arg); + void *init_arg; +}; + +#ifdef CONFIG_PAGE_POOL_STATS +/** + * struct page_pool_alloc_stats - allocation statistics + * @fast: successful fast path allocations + * @slow: slow path order-0 allocations + * @slow_high_order: slow path high order allocations + * @empty: ptr ring is empty, so a slow path allocation was forced + * @refill: an allocation which triggered a refill of the cache + * @waive: pages obtained from the ptr ring that cannot be added to + * the cache due to a NUMA mismatch + */ +struct page_pool_alloc_stats { + u64 fast; + u64 slow; + u64 slow_high_order; + u64 empty; + u64 refill; + u64 waive; +}; + +/** + * struct page_pool_recycle_stats - recycling (freeing) statistics + * @cached: recycling placed page in the page pool cache + * @cache_full: page pool cache was full + * @ring: page placed into the ptr ring + * @ring_full: page released from page pool because the ptr ring was full + * @released_refcnt: page released (and not recycled) because refcnt > 1 + */ +struct page_pool_recycle_stats { + u64 cached; + u64 cache_full; + u64 ring; + u64 ring_full; + u64 released_refcnt; +}; + +/** + * struct page_pool_stats - combined page pool use statistics + * @alloc_stats: see struct page_pool_alloc_stats + * @recycle_stats: see struct page_pool_recycle_stats + * + * Wrapper struct for combining page pool stats with different storage + * requirements. + */ +struct page_pool_stats { + struct page_pool_alloc_stats alloc_stats; + struct page_pool_recycle_stats recycle_stats; +}; +#endif + +struct page_pool { + struct page_pool_params p; + + struct delayed_work release_dw; + void (*disconnect)(void *pool); + unsigned long defer_start; + unsigned long defer_warn; + + u32 pages_state_hold_cnt; + unsigned int frag_offset; + struct page *frag_page; + long frag_users; + +#ifdef CONFIG_PAGE_POOL_STATS + /* these stats are incremented while in softirq context */ + struct page_pool_alloc_stats alloc_stats; +#endif + u32 xdp_mem_id; + + /* + * Data structure for allocation side + * + * Drivers allocation side usually already perform some kind + * of resource protection. Piggyback on this protection, and + * require driver to protect allocation side. + * + * For NIC drivers this means, allocate a page_pool per + * RX-queue. As the RX-queue is already protected by + * Softirq/BH scheduling and napi_schedule. NAPI schedule + * guarantee that a single napi_struct will only be scheduled + * on a single CPU (see napi_schedule). + */ + struct pp_alloc_cache alloc ____cacheline_aligned_in_smp; + + /* Data structure for storing recycled pages. + * + * Returning/freeing pages is more complicated synchronization + * wise, because free's can happen on remote CPUs, with no + * association with allocation resource. + * + * Use ptr_ring, as it separates consumer and producer + * efficiently, it a way that doesn't bounce cache-lines. + * + * TODO: Implement bulk return pages into this structure. + */ + struct ptr_ring ring; + +#ifdef CONFIG_PAGE_POOL_STATS + /* recycle stats are per-cpu to avoid locking */ + struct page_pool_recycle_stats __percpu *recycle_stats; +#endif + atomic_t pages_state_release_cnt; + + /* A page_pool is strictly tied to a single RX-queue being + * protected by NAPI, due to above pp_alloc_cache. This + * refcnt serves purpose is to simplify drivers error handling. + */ + refcount_t user_cnt; + + u64 destroy_cnt; +}; + +struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); +struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, + unsigned int size, gfp_t gfp); +bool page_pool_return_skb_page(struct page *page, bool napi_safe); + +struct page_pool *page_pool_create(const struct page_pool_params *params); + +struct xdp_mem_info; + +#ifdef CONFIG_PAGE_POOL +void page_pool_unlink_napi(struct page_pool *pool); +void page_pool_destroy(struct page_pool *pool); +void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *), + struct xdp_mem_info *mem); +void page_pool_put_page_bulk(struct page_pool *pool, void **data, + int count); +#else +static inline void page_pool_unlink_napi(struct page_pool *pool) +{ +} + +static inline void page_pool_destroy(struct page_pool *pool) +{ +} + +static inline void page_pool_use_xdp_mem(struct page_pool *pool, + void (*disconnect)(void *), + struct xdp_mem_info *mem) +{ +} + +static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data, + int count) +{ +} +#endif + +void page_pool_put_defragged_page(struct page_pool *pool, struct page *page, + unsigned int dma_sync_size, + bool allow_direct); + +static inline bool is_page_pool_compiled_in(void) +{ +#ifdef CONFIG_PAGE_POOL + return true; +#else + return false; +#endif +} + +/* Caller must provide appropriate safe context, e.g. NAPI. */ +void page_pool_update_nid(struct page_pool *pool, int new_nid); + +#endif /* _NET_PAGE_POOL_H */ diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h index ca534501158b..6834356b2d2a 100644 --- a/include/trace/events/page_pool.h +++ b/include/trace/events/page_pool.h @@ -9,7 +9,7 @@ #include #include -#include +#include TRACE_EVENT(page_pool_release, diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 0aac76c13fd4..57a7a64b84ed 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -15,7 +15,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 5d615a169718..cd28c1f14002 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -10,7 +10,7 @@ #include #include -#include +#include #include #include diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c6f98245582c..d3bed964123c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -73,7 +73,7 @@ #include #include #include -#include +#include #include #include diff --git a/net/core/xdp.c b/net/core/xdp.c index 8362130bf085..a70670fe9a2d 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -14,7 +14,7 @@ #include #include #include -#include +#include #include #include /* struct xdp_mem_allocator */ -- cgit v1.2.3 From 7149b38dc7cbc77548afd65e4bb4d4b11dca8670 Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Fri, 4 Aug 2023 15:30:19 +0200 Subject: net: fs_enet: Remove linux/fs_enet_pd.h linux/fs_enet_pd.h is not used anymore. Remove it. Signed-off-by: Christophe Leroy Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/5be102791c987792ad127b15543ee6715394cf67.1691155347.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski --- MAINTAINERS | 1 - include/linux/fs_enet_pd.h | 118 --------------------------------------------- 2 files changed, 119 deletions(-) delete mode 100644 include/linux/fs_enet_pd.h (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 08bcf3a7c482..6efcd8713682 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8371,7 +8371,6 @@ L: linuxppc-dev@lists.ozlabs.org L: netdev@vger.kernel.org S: Maintained F: drivers/net/ethernet/freescale/fs_enet/ -F: include/linux/fs_enet_pd.h FREESCALE SOC SOUND DRIVERS M: Shengjiu Wang diff --git a/include/linux/fs_enet_pd.h b/include/linux/fs_enet_pd.h deleted file mode 100644 index 7c9897dab558..000000000000 --- a/include/linux/fs_enet_pd.h +++ /dev/null @@ -1,118 +0,0 @@ -/* - * Platform information definitions for the - * universal Freescale Ethernet driver. - * - * Copyright (c) 2003 Intracom S.A. - * by Pantelis Antoniou - * - * 2005 (c) MontaVista Software, Inc. - * Vitaly Bordug - * - * This file is licensed under the terms of the GNU General Public License - * version 2. This program is licensed "as is" without any warranty of any - * kind, whether express or implied. - */ - -#ifndef FS_ENET_PD_H -#define FS_ENET_PD_H - -#include -#include -#include -#include -#include - -#define FS_ENET_NAME "fs_enet" - -enum fs_id { - fsid_fec1, - fsid_fec2, - fsid_fcc1, - fsid_fcc2, - fsid_fcc3, - fsid_scc1, - fsid_scc2, - fsid_scc3, - fsid_scc4, -}; - -#define FS_MAX_INDEX 9 - -static inline int fs_get_fec_index(enum fs_id id) -{ - if (id >= fsid_fec1 && id <= fsid_fec2) - return id - fsid_fec1; - return -1; -} - -static inline int fs_get_fcc_index(enum fs_id id) -{ - if (id >= fsid_fcc1 && id <= fsid_fcc3) - return id - fsid_fcc1; - return -1; -} - -static inline int fs_get_scc_index(enum fs_id id) -{ - if (id >= fsid_scc1 && id <= fsid_scc4) - return id - fsid_scc1; - return -1; -} - -static inline int fs_fec_index2id(int index) -{ - int id = fsid_fec1 + index - 1; - if (id >= fsid_fec1 && id <= fsid_fec2) - return id; - return FS_MAX_INDEX; - } - -static inline int fs_fcc_index2id(int index) -{ - int id = fsid_fcc1 + index - 1; - if (id >= fsid_fcc1 && id <= fsid_fcc3) - return id; - return FS_MAX_INDEX; -} - -static inline int fs_scc_index2id(int index) -{ - int id = fsid_scc1 + index - 1; - if (id >= fsid_scc1 && id <= fsid_scc4) - return id; - return FS_MAX_INDEX; -} - -enum fs_mii_method { - fsmii_fixed, - fsmii_fec, - fsmii_bitbang, -}; - -enum fs_ioport { - fsiop_porta, - fsiop_portb, - fsiop_portc, - fsiop_portd, - fsiop_porte, -}; - -struct fs_mii_bit { - u32 offset; - u8 bit; - u8 polarity; -}; -struct fs_mii_bb_platform_info { - struct fs_mii_bit mdio_dir; - struct fs_mii_bit mdio_dat; - struct fs_mii_bit mdc_dat; - int delay; /* delay in us */ - int irq[32]; /* irqs per phy's */ -}; - -struct fs_mii_fec_platform_info { - u32 irq[32]; - u32 mii_speed; -}; - -#endif -- cgit v1.2.3 From 40b0425f8ba17c32cf7182975032a3999c364dfc Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Mon, 7 Aug 2023 22:33:19 +0300 Subject: net: ptp: create a mock-up PTP Hardware Clock driver There are several cases where virtual net devices may benefit from having a PTP clock, and these have to do with testing. I can see at least netdevsim and veth as potential users of a common mock-up PTP hardware clock driver. The proposed idea is to create an object which emulates PTP clock operations on top of the unadjustable CLOCK_MONOTONIC_RAW plus a software-controlled time domain via a timecounter/cyclecounter and then link that PHC to the netdevsim device. The driver is fully functional for its intended purpose, and it successfully passes the PTP selftests. $ cd tools/testing/selftests/ptp/ $ ./phc.sh /dev/ptp2 TEST: settime [ OK ] TEST: adjtime [ OK ] TEST: adjfreq [ OK ] Signed-off-by: Vladimir Oltean Link: https://lore.kernel.org/r/20230807193324.4128292-7-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- MAINTAINERS | 7 ++ drivers/ptp/Kconfig | 11 +++ drivers/ptp/Makefile | 1 + drivers/ptp/ptp_mock.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ptp_mock.h | 38 ++++++++++ 5 files changed, 232 insertions(+) create mode 100644 drivers/ptp/ptp_mock.c create mode 100644 include/linux/ptp_mock.h (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 6efcd8713682..49f032556cf1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17170,6 +17170,13 @@ F: drivers/ptp/* F: include/linux/ptp_cl* K: (?:\b|_)ptp(?:\b|_) +PTP MOCKUP CLOCK SUPPORT +M: Vladimir Oltean +L: netdev@vger.kernel.org +S: Maintained +F: drivers/ptp/ptp_mock.c +F: include/linux/ptp_mock.h + PTP VIRTUAL CLOCK SUPPORT M: Yangbo Lu L: netdev@vger.kernel.org diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig index 32dff1b4f891..ed9d97a032f1 100644 --- a/drivers/ptp/Kconfig +++ b/drivers/ptp/Kconfig @@ -155,6 +155,17 @@ config PTP_1588_CLOCK_IDTCM To compile this driver as a module, choose M here: the module will be called ptp_clockmatrix. +config PTP_1588_CLOCK_MOCK + tristate "Mock-up PTP clock" + depends on PTP_1588_CLOCK + help + This driver offers a set of PTP clock manipulation operations over + the system monotonic time. It can be used by virtual network device + drivers to emulate PTP capabilities. + + To compile this driver as a module, choose M here: the module + will be called ptp_mock. + config PTP_1588_CLOCK_VMW tristate "VMware virtual PTP clock" depends on ACPI && HYPERVISOR_GUEST && X86 diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile index 553f18bf3c83..dea0cebd2303 100644 --- a/drivers/ptp/Makefile +++ b/drivers/ptp/Makefile @@ -16,6 +16,7 @@ ptp-qoriq-y += ptp_qoriq.o ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o obj-$(CONFIG_PTP_1588_CLOCK_IDTCM) += ptp_clockmatrix.o obj-$(CONFIG_PTP_1588_CLOCK_IDT82P33) += ptp_idt82p33.o +obj-$(CONFIG_PTP_1588_CLOCK_MOCK) += ptp_mock.o obj-$(CONFIG_PTP_1588_CLOCK_VMW) += ptp_vmw.o obj-$(CONFIG_PTP_1588_CLOCK_OCP) += ptp_ocp.o obj-$(CONFIG_PTP_DFL_TOD) += ptp_dfl_tod.o diff --git a/drivers/ptp/ptp_mock.c b/drivers/ptp/ptp_mock.c new file mode 100644 index 000000000000..e7b459c846a2 --- /dev/null +++ b/drivers/ptp/ptp_mock.c @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright 2023 NXP + * + * Mock-up PTP Hardware Clock driver for virtual network devices + * + * Create a PTP clock which offers PTP time manipulation operations + * using a timecounter/cyclecounter on top of CLOCK_MONOTONIC_RAW. + */ + +#include +#include +#include + +/* Clamp scaled_ppm between -2,097,152,000 and 2,097,152,000, + * and thus "adj" between -68,719,476 and 68,719,476 + */ +#define MOCK_PHC_MAX_ADJ_PPB 32000000 +/* Timestamps from ktime_get_raw() have 1 ns resolution, so the scale factor + * (MULT >> SHIFT) needs to be 1. Pick SHIFT as 31 bits, which translates + * MULT(freq 0) into 0x80000000. + */ +#define MOCK_PHC_CC_SHIFT 31 +#define MOCK_PHC_CC_MULT (1 << MOCK_PHC_CC_SHIFT) +#define MOCK_PHC_FADJ_SHIFT 9 +#define MOCK_PHC_FADJ_DENOMINATOR 15625ULL + +/* The largest cycle_delta that timecounter_read_delta() can handle without a + * 64-bit overflow during the multiplication with cc->mult, given the max "adj" + * we permit, is ~8.3 seconds. Make sure readouts are more frequent than that. + */ +#define MOCK_PHC_REFRESH_INTERVAL (HZ * 5) + +#define info_to_phc(d) container_of((d), struct mock_phc, info) + +struct mock_phc { + struct ptp_clock_info info; + struct ptp_clock *clock; + struct timecounter tc; + struct cyclecounter cc; + spinlock_t lock; +}; + +static u64 mock_phc_cc_read(const struct cyclecounter *cc) +{ + return ktime_get_raw_ns(); +} + +static int mock_phc_adjfine(struct ptp_clock_info *info, long scaled_ppm) +{ + struct mock_phc *phc = info_to_phc(info); + s64 adj; + + adj = (s64)scaled_ppm << MOCK_PHC_FADJ_SHIFT; + adj = div_s64(adj, MOCK_PHC_FADJ_DENOMINATOR); + + spin_lock(&phc->lock); + timecounter_read(&phc->tc); + phc->cc.mult = MOCK_PHC_CC_MULT + adj; + spin_unlock(&phc->lock); + + return 0; +} + +static int mock_phc_adjtime(struct ptp_clock_info *info, s64 delta) +{ + struct mock_phc *phc = info_to_phc(info); + + spin_lock(&phc->lock); + timecounter_adjtime(&phc->tc, delta); + spin_unlock(&phc->lock); + + return 0; +} + +static int mock_phc_settime64(struct ptp_clock_info *info, + const struct timespec64 *ts) +{ + struct mock_phc *phc = info_to_phc(info); + u64 ns = timespec64_to_ns(ts); + + spin_lock(&phc->lock); + timecounter_init(&phc->tc, &phc->cc, ns); + spin_unlock(&phc->lock); + + return 0; +} + +static int mock_phc_gettime64(struct ptp_clock_info *info, struct timespec64 *ts) +{ + struct mock_phc *phc = info_to_phc(info); + u64 ns; + + spin_lock(&phc->lock); + ns = timecounter_read(&phc->tc); + spin_unlock(&phc->lock); + + *ts = ns_to_timespec64(ns); + + return 0; +} + +static long mock_phc_refresh(struct ptp_clock_info *info) +{ + struct timespec64 ts; + + mock_phc_gettime64(info, &ts); + + return MOCK_PHC_REFRESH_INTERVAL; +} + +int mock_phc_index(struct mock_phc *phc) +{ + return ptp_clock_index(phc->clock); +} +EXPORT_SYMBOL_GPL(mock_phc_index); + +struct mock_phc *mock_phc_create(struct device *dev) +{ + struct mock_phc *phc; + int err; + + phc = kzalloc(sizeof(*phc), GFP_KERNEL); + if (!phc) { + err = -ENOMEM; + goto out; + } + + phc->info = (struct ptp_clock_info) { + .owner = THIS_MODULE, + .name = "Mock-up PTP clock", + .max_adj = MOCK_PHC_MAX_ADJ_PPB, + .adjfine = mock_phc_adjfine, + .adjtime = mock_phc_adjtime, + .gettime64 = mock_phc_gettime64, + .settime64 = mock_phc_settime64, + .do_aux_work = mock_phc_refresh, + }; + + phc->cc = (struct cyclecounter) { + .read = mock_phc_cc_read, + .mask = CYCLECOUNTER_MASK(64), + .mult = MOCK_PHC_CC_MULT, + .shift = MOCK_PHC_CC_SHIFT, + }; + + spin_lock_init(&phc->lock); + timecounter_init(&phc->tc, &phc->cc, 0); + + phc->clock = ptp_clock_register(&phc->info, dev); + if (IS_ERR(phc->clock)) { + err = PTR_ERR(phc->clock); + goto out_free_phc; + } + + ptp_schedule_worker(phc->clock, MOCK_PHC_REFRESH_INTERVAL); + + return phc; + +out_free_phc: + kfree(phc); +out: + return ERR_PTR(err); +} +EXPORT_SYMBOL_GPL(mock_phc_create); + +void mock_phc_destroy(struct mock_phc *phc) +{ + ptp_clock_unregister(phc->clock); + kfree(phc); +} +EXPORT_SYMBOL_GPL(mock_phc_destroy); + +MODULE_DESCRIPTION("Mock-up PTP Hardware Clock driver"); +MODULE_LICENSE("GPL"); diff --git a/include/linux/ptp_mock.h b/include/linux/ptp_mock.h new file mode 100644 index 000000000000..72eb401034d9 --- /dev/null +++ b/include/linux/ptp_mock.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Mock-up PTP Hardware Clock driver for virtual network devices + * + * Copyright 2023 NXP + */ + +#ifndef _PTP_MOCK_H_ +#define _PTP_MOCK_H_ + +struct device; +struct mock_phc; + +#if IS_ENABLED(CONFIG_PTP_1588_CLOCK_MOCK) + +struct mock_phc *mock_phc_create(struct device *dev); +void mock_phc_destroy(struct mock_phc *phc); +int mock_phc_index(struct mock_phc *phc); + +#else + +static inline struct mock_phc *mock_phc_create(struct device *dev) +{ + return NULL; +} + +static inline void mock_phc_destroy(struct mock_phc *phc) +{ +} + +static inline int mock_phc_index(struct mock_phc *phc) +{ + return -1; +} + +#endif + +#endif /* _PTP_MOCK_H_ */ -- cgit v1.2.3 From 7fd034bce6d276353a4e70c516789dfe6b2c32ae Mon Sep 17 00:00:00 2001 From: Louis Peens Date: Tue, 15 Aug 2023 14:43:25 +0200 Subject: nfp: update maintainer Take over maintainership of the nfp driver from Simon as he is moving away from Corigine. Signed-off-by: Louis Peens Acked-by: Simon Horman Signed-off-by: David S. Miller --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index d984c9a7b12c..f150561a396e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14658,7 +14658,7 @@ F: drivers/rtc/rtc-ntxec.c F: include/linux/mfd/ntxec.h NETRONOME ETHERNET DRIVERS -M: Simon Horman +M: Louis Peens R: Jakub Kicinski L: oss-drivers@corigine.com S: Maintained -- cgit v1.2.3 From f629acc6f21043fdc80ae93cdafa6713888db0fd Mon Sep 17 00:00:00 2001 From: Jiawen Wu Date: Wed, 23 Aug 2023 14:19:29 +0800 Subject: net: pcs: xpcs: support to switch mode for Wangxun NICs According to chapter 6 of DesignWare Cores Ethernet PCS (version 3.20a) and custom design manual, add a configuration flow for switching interface mode. If the interface changes, the following setting is required: 1. wait VR_XS_PCS_DIG_STS bit(4, 2) [PSEQ_STATE] = 100b (Power-Good) 2. write SR_XS_PCS_CTRL2 to select various PCS type 3. write SR_PMA_CTRL1 and/or SR_XS_PCS_CTRL1 for link speed 4. program PMA registers 5. write VR_XS_PCS_DIG_CTRL1 bit(15) [VR_RST] = 1b (Vendor-Specific Soft Reset) 6. wait for VR_XS_PCS_DIG_CTRL1 bit(15) [VR_RST] to get cleared Only 10GBASE-R/SGMII/1000BASE-X modes are planned for the current Wangxun devices. And there is a quirk for Wangxun devices to switch mode although the interface in phylink state has not changed, since PCS will change to default 10GBASE-R when the ethernet driver(txgbe) do LAN reset. Signed-off-by: Jiawen Wu Signed-off-by: David S. Miller --- MAINTAINERS | 1 + drivers/net/pcs/Makefile | 2 +- drivers/net/pcs/pcs-xpcs-wx.c | 208 ++++++++++++++++++++++++++++++++++++++++++ drivers/net/pcs/pcs-xpcs.c | 13 ++- drivers/net/pcs/pcs-xpcs.h | 8 ++ include/linux/pcs/pcs-xpcs.h | 1 + 6 files changed, 227 insertions(+), 6 deletions(-) create mode 100644 drivers/net/pcs/pcs-xpcs-wx.c (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index dfc2ba0c20f3..5f527a9ae4b2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -22925,6 +22925,7 @@ S: Maintained W: https://www.net-swift.com F: Documentation/networking/device_drivers/ethernet/wangxun/* F: drivers/net/ethernet/wangxun/ +F: drivers/net/pcs/pcs-xpcs-wx.c WATCHDOG DEVICE DRIVERS M: Wim Van Sebroeck diff --git a/drivers/net/pcs/Makefile b/drivers/net/pcs/Makefile index ea662a7989b2..fb1694192ae6 100644 --- a/drivers/net/pcs/Makefile +++ b/drivers/net/pcs/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 # Makefile for Linux PCS drivers -pcs_xpcs-$(CONFIG_PCS_XPCS) := pcs-xpcs.o pcs-xpcs-nxp.o +pcs_xpcs-$(CONFIG_PCS_XPCS) := pcs-xpcs.o pcs-xpcs-nxp.o pcs-xpcs-wx.o obj-$(CONFIG_PCS_XPCS) += pcs_xpcs.o obj-$(CONFIG_PCS_LYNX) += pcs-lynx.o diff --git a/drivers/net/pcs/pcs-xpcs-wx.c b/drivers/net/pcs/pcs-xpcs-wx.c new file mode 100644 index 000000000000..1f228d5a1398 --- /dev/null +++ b/drivers/net/pcs/pcs-xpcs-wx.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2015 - 2023 Beijing WangXun Technology Co., Ltd. */ + +#include +#include +#include "pcs-xpcs.h" + +/* VR_XS_PMA_MMD */ +#define TXGBE_PMA_MMD 0x8020 +#define TXGBE_TX_GENCTL1 0x11 +#define TXGBE_TX_GENCTL1_VBOOST_LVL GENMASK(10, 8) +#define TXGBE_TX_GENCTL1_VBOOST_EN0 BIT(4) +#define TXGBE_TX_GEN_CTL2 0x12 +#define TXGBE_TX_GEN_CTL2_TX0_WIDTH(v) FIELD_PREP(GENMASK(9, 8), v) +#define TXGBE_TX_RATE_CTL 0x14 +#define TXGBE_TX_RATE_CTL_TX0_RATE(v) FIELD_PREP(GENMASK(2, 0), v) +#define TXGBE_RX_GEN_CTL2 0x32 +#define TXGBE_RX_GEN_CTL2_RX0_WIDTH(v) FIELD_PREP(GENMASK(9, 8), v) +#define TXGBE_RX_GEN_CTL3 0x33 +#define TXGBE_RX_GEN_CTL3_LOS_TRSHLD0 GENMASK(2, 0) +#define TXGBE_RX_RATE_CTL 0x34 +#define TXGBE_RX_RATE_CTL_RX0_RATE(v) FIELD_PREP(GENMASK(1, 0), v) +#define TXGBE_RX_EQ_ATTN_CTL 0x37 +#define TXGBE_RX_EQ_ATTN_LVL0 GENMASK(2, 0) +#define TXGBE_RX_EQ_CTL0 0x38 +#define TXGBE_RX_EQ_CTL0_VGA1_GAIN(v) FIELD_PREP(GENMASK(15, 12), v) +#define TXGBE_RX_EQ_CTL0_VGA2_GAIN(v) FIELD_PREP(GENMASK(11, 8), v) +#define TXGBE_RX_EQ_CTL0_CTLE_POLE(v) FIELD_PREP(GENMASK(7, 5), v) +#define TXGBE_RX_EQ_CTL0_CTLE_BOOST(v) FIELD_PREP(GENMASK(4, 0), v) +#define TXGBE_RX_EQ_CTL4 0x3C +#define TXGBE_RX_EQ_CTL4_CONT_OFF_CAN0 BIT(4) +#define TXGBE_RX_EQ_CTL4_CONT_ADAPT0 BIT(0) +#define TXGBE_AFE_DFE_ENABLE 0x3D +#define TXGBE_DFE_EN_0 BIT(4) +#define TXGBE_AFE_EN_0 BIT(0) +#define TXGBE_DFE_TAP_CTL0 0x3E +#define TXGBE_MPLLA_CTL0 0x51 +#define TXGBE_MPLLA_CTL2 0x53 +#define TXGBE_MPLLA_CTL2_DIV16P5_CLK_EN BIT(10) +#define TXGBE_MPLLA_CTL2_DIV10_CLK_EN BIT(9) +#define TXGBE_MPLLA_CTL3 0x57 +#define TXGBE_MISC_CTL0 0x70 +#define TXGBE_MISC_CTL0_PLL BIT(15) +#define TXGBE_MISC_CTL0_CR_PARA_SEL BIT(14) +#define TXGBE_MISC_CTL0_RX_VREF(v) FIELD_PREP(GENMASK(12, 8), v) +#define TXGBE_VCO_CAL_LD0 0x72 +#define TXGBE_VCO_CAL_REF0 0x76 + +static int txgbe_read_pma(struct dw_xpcs *xpcs, int reg) +{ + return xpcs_read(xpcs, MDIO_MMD_PMAPMD, TXGBE_PMA_MMD + reg); +} + +static int txgbe_write_pma(struct dw_xpcs *xpcs, int reg, u16 val) +{ + return xpcs_write(xpcs, MDIO_MMD_PMAPMD, TXGBE_PMA_MMD + reg, val); +} + +static void txgbe_pma_config_10gbaser(struct dw_xpcs *xpcs) +{ + int val; + + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL0, 0x21); + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL3, 0); + val = txgbe_read_pma(xpcs, TXGBE_TX_GENCTL1); + val = u16_replace_bits(val, 0x5, TXGBE_TX_GENCTL1_VBOOST_LVL); + txgbe_write_pma(xpcs, TXGBE_TX_GENCTL1, val); + txgbe_write_pma(xpcs, TXGBE_MISC_CTL0, TXGBE_MISC_CTL0_PLL | + TXGBE_MISC_CTL0_CR_PARA_SEL | TXGBE_MISC_CTL0_RX_VREF(0xF)); + txgbe_write_pma(xpcs, TXGBE_VCO_CAL_LD0, 0x549); + txgbe_write_pma(xpcs, TXGBE_VCO_CAL_REF0, 0x29); + txgbe_write_pma(xpcs, TXGBE_TX_RATE_CTL, 0); + txgbe_write_pma(xpcs, TXGBE_RX_RATE_CTL, 0); + txgbe_write_pma(xpcs, TXGBE_TX_GEN_CTL2, TXGBE_TX_GEN_CTL2_TX0_WIDTH(3)); + txgbe_write_pma(xpcs, TXGBE_RX_GEN_CTL2, TXGBE_RX_GEN_CTL2_RX0_WIDTH(3)); + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL2, TXGBE_MPLLA_CTL2_DIV16P5_CLK_EN | + TXGBE_MPLLA_CTL2_DIV10_CLK_EN); + + txgbe_write_pma(xpcs, TXGBE_RX_EQ_CTL0, TXGBE_RX_EQ_CTL0_CTLE_POLE(2) | + TXGBE_RX_EQ_CTL0_CTLE_BOOST(5)); + val = txgbe_read_pma(xpcs, TXGBE_RX_EQ_ATTN_CTL); + val &= ~TXGBE_RX_EQ_ATTN_LVL0; + txgbe_write_pma(xpcs, TXGBE_RX_EQ_ATTN_CTL, val); + txgbe_write_pma(xpcs, TXGBE_DFE_TAP_CTL0, 0xBE); + val = txgbe_read_pma(xpcs, TXGBE_AFE_DFE_ENABLE); + val &= ~(TXGBE_DFE_EN_0 | TXGBE_AFE_EN_0); + txgbe_write_pma(xpcs, TXGBE_AFE_DFE_ENABLE, val); + val = txgbe_read_pma(xpcs, TXGBE_RX_EQ_CTL4); + val &= ~TXGBE_RX_EQ_CTL4_CONT_ADAPT0; + txgbe_write_pma(xpcs, TXGBE_RX_EQ_CTL4, val); +} + +static void txgbe_pma_config_1g(struct dw_xpcs *xpcs) +{ + int val; + + val = txgbe_read_pma(xpcs, TXGBE_TX_GENCTL1); + val = u16_replace_bits(val, 0x5, TXGBE_TX_GENCTL1_VBOOST_LVL); + val &= ~TXGBE_TX_GENCTL1_VBOOST_EN0; + txgbe_write_pma(xpcs, TXGBE_TX_GENCTL1, val); + txgbe_write_pma(xpcs, TXGBE_MISC_CTL0, TXGBE_MISC_CTL0_PLL | + TXGBE_MISC_CTL0_CR_PARA_SEL | TXGBE_MISC_CTL0_RX_VREF(0xF)); + + txgbe_write_pma(xpcs, TXGBE_RX_EQ_CTL0, TXGBE_RX_EQ_CTL0_VGA1_GAIN(7) | + TXGBE_RX_EQ_CTL0_VGA2_GAIN(7) | TXGBE_RX_EQ_CTL0_CTLE_BOOST(6)); + val = txgbe_read_pma(xpcs, TXGBE_RX_EQ_ATTN_CTL); + val &= ~TXGBE_RX_EQ_ATTN_LVL0; + txgbe_write_pma(xpcs, TXGBE_RX_EQ_ATTN_CTL, val); + txgbe_write_pma(xpcs, TXGBE_DFE_TAP_CTL0, 0); + val = txgbe_read_pma(xpcs, TXGBE_RX_GEN_CTL3); + val = u16_replace_bits(val, 0x4, TXGBE_RX_GEN_CTL3_LOS_TRSHLD0); + txgbe_write_pma(xpcs, TXGBE_RX_EQ_ATTN_CTL, val); + + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL0, 0x20); + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL3, 0x46); + txgbe_write_pma(xpcs, TXGBE_VCO_CAL_LD0, 0x540); + txgbe_write_pma(xpcs, TXGBE_VCO_CAL_REF0, 0x2A); + txgbe_write_pma(xpcs, TXGBE_AFE_DFE_ENABLE, 0); + txgbe_write_pma(xpcs, TXGBE_RX_EQ_CTL4, TXGBE_RX_EQ_CTL4_CONT_OFF_CAN0); + txgbe_write_pma(xpcs, TXGBE_TX_RATE_CTL, TXGBE_TX_RATE_CTL_TX0_RATE(3)); + txgbe_write_pma(xpcs, TXGBE_RX_RATE_CTL, TXGBE_RX_RATE_CTL_RX0_RATE(3)); + txgbe_write_pma(xpcs, TXGBE_TX_GEN_CTL2, TXGBE_TX_GEN_CTL2_TX0_WIDTH(1)); + txgbe_write_pma(xpcs, TXGBE_RX_GEN_CTL2, TXGBE_RX_GEN_CTL2_RX0_WIDTH(1)); + txgbe_write_pma(xpcs, TXGBE_MPLLA_CTL2, TXGBE_MPLLA_CTL2_DIV10_CLK_EN); +} + +static int txgbe_pcs_poll_power_up(struct dw_xpcs *xpcs) +{ + int val, ret; + + /* Wait xpcs power-up good */ + ret = read_poll_timeout(xpcs_read_vpcs, val, + (val & DW_PSEQ_ST) == DW_PSEQ_ST_GOOD, + 10000, 1000000, false, + xpcs, DW_VR_XS_PCS_DIG_STS); + if (ret < 0) + dev_err(&xpcs->mdiodev->dev, "xpcs power-up timeout\n"); + + return ret; +} + +static int txgbe_pma_init_done(struct dw_xpcs *xpcs) +{ + int val, ret; + + xpcs_write_vpcs(xpcs, DW_VR_XS_PCS_DIG_CTRL1, DW_VR_RST | DW_EN_VSMMD1); + + /* wait pma initialization done */ + ret = read_poll_timeout(xpcs_read_vpcs, val, !(val & DW_VR_RST), + 100000, 10000000, false, + xpcs, DW_VR_XS_PCS_DIG_CTRL1); + if (ret < 0) + dev_err(&xpcs->mdiodev->dev, "xpcs pma initialization timeout\n"); + + return ret; +} + +static bool txgbe_xpcs_mode_quirk(struct dw_xpcs *xpcs) +{ + int ret; + + /* When txgbe do LAN reset, PCS will change to default 10GBASE-R mode */ + ret = xpcs_read(xpcs, MDIO_MMD_PCS, MDIO_CTRL2); + ret &= MDIO_PCS_CTRL2_TYPE; + if (ret == MDIO_PCS_CTRL2_10GBR && + xpcs->interface != PHY_INTERFACE_MODE_10GBASER) + return true; + + return false; +} + +int txgbe_xpcs_switch_mode(struct dw_xpcs *xpcs, phy_interface_t interface) +{ + int val, ret; + + switch (interface) { + case PHY_INTERFACE_MODE_10GBASER: + case PHY_INTERFACE_MODE_SGMII: + case PHY_INTERFACE_MODE_1000BASEX: + break; + default: + return 0; + } + + if (xpcs->interface == interface && !txgbe_xpcs_mode_quirk(xpcs)) + return 0; + + xpcs->interface = interface; + + ret = txgbe_pcs_poll_power_up(xpcs); + if (ret < 0) + return ret; + + if (interface == PHY_INTERFACE_MODE_10GBASER) { + xpcs_write(xpcs, MDIO_MMD_PCS, MDIO_CTRL2, MDIO_PCS_CTRL2_10GBR); + val = xpcs_read(xpcs, MDIO_MMD_PMAPMD, MDIO_CTRL1); + val |= MDIO_CTRL1_SPEED10G; + xpcs_write(xpcs, MDIO_MMD_PMAPMD, MDIO_CTRL1, val); + txgbe_pma_config_10gbaser(xpcs); + } else { + xpcs_write(xpcs, MDIO_MMD_PCS, MDIO_CTRL2, MDIO_PCS_CTRL2_10GBX); + xpcs_write(xpcs, MDIO_MMD_PMAPMD, MDIO_CTRL1, 0); + xpcs_write(xpcs, MDIO_MMD_PCS, MDIO_CTRL1, 0); + txgbe_pma_config_1g(xpcs); + } + + return txgbe_pma_init_done(xpcs); +} diff --git a/drivers/net/pcs/pcs-xpcs.c b/drivers/net/pcs/pcs-xpcs.c index 8b56b2f9f24d..4cd011405376 100644 --- a/drivers/net/pcs/pcs-xpcs.c +++ b/drivers/net/pcs/pcs-xpcs.c @@ -228,12 +228,12 @@ static int xpcs_write_vendor(struct dw_xpcs *xpcs, int dev, int reg, return xpcs_write(xpcs, dev, DW_VENDOR | reg, val); } -static int xpcs_read_vpcs(struct dw_xpcs *xpcs, int reg) +int xpcs_read_vpcs(struct dw_xpcs *xpcs, int reg) { return xpcs_read_vendor(xpcs, MDIO_MMD_PCS, reg); } -static int xpcs_write_vpcs(struct dw_xpcs *xpcs, int reg, u16 val) +int xpcs_write_vpcs(struct dw_xpcs *xpcs, int reg, u16 val) { return xpcs_write_vendor(xpcs, MDIO_MMD_PCS, reg, val); } @@ -841,6 +841,12 @@ int xpcs_do_config(struct dw_xpcs *xpcs, phy_interface_t interface, if (!compat) return -ENODEV; + if (xpcs->dev_flag == DW_DEV_TXGBE) { + ret = txgbe_xpcs_switch_mode(xpcs, interface); + if (ret) + return ret; + } + switch (compat->an_mode) { case DW_10GBASER: break; @@ -1313,9 +1319,6 @@ static struct dw_xpcs *xpcs_create(struct mdio_device *mdiodev, xpcs->pcs.ops = &xpcs_phylink_ops; xpcs->pcs.neg_mode = true; - if (compat->an_mode == DW_10GBASER) - return xpcs; - xpcs->pcs.poll = true; if (xpcs->dev_flag != DW_DEV_TXGBE) { diff --git a/drivers/net/pcs/pcs-xpcs.h b/drivers/net/pcs/pcs-xpcs.h index 68c6b5a62088..da61ad36946c 100644 --- a/drivers/net/pcs/pcs-xpcs.h +++ b/drivers/net/pcs/pcs-xpcs.h @@ -15,8 +15,13 @@ /* VR_XS_PCS */ #define DW_USXGMII_RST BIT(10) #define DW_USXGMII_EN BIT(9) +#define DW_VR_XS_PCS_DIG_CTRL1 0x0000 +#define DW_VR_RST BIT(15) +#define DW_EN_VSMMD1 BIT(13) #define DW_VR_XS_PCS_DIG_STS 0x0010 #define DW_RXFIFO_ERR GENMASK(6, 5) +#define DW_PSEQ_ST GENMASK(4, 2) +#define DW_PSEQ_ST_GOOD FIELD_PREP(GENMASK(4, 2), 0x4) /* SR_MII */ #define DW_USXGMII_FULL BIT(8) @@ -106,6 +111,9 @@ int xpcs_read(struct dw_xpcs *xpcs, int dev, u32 reg); int xpcs_write(struct dw_xpcs *xpcs, int dev, u32 reg, u16 val); +int xpcs_read_vpcs(struct dw_xpcs *xpcs, int reg); +int xpcs_write_vpcs(struct dw_xpcs *xpcs, int reg, u16 val); int nxp_sja1105_sgmii_pma_config(struct dw_xpcs *xpcs); int nxp_sja1110_sgmii_pma_config(struct dw_xpcs *xpcs); int nxp_sja1110_2500basex_pma_config(struct dw_xpcs *xpcs); +int txgbe_xpcs_switch_mode(struct dw_xpcs *xpcs, phy_interface_t interface); diff --git a/include/linux/pcs/pcs-xpcs.h b/include/linux/pcs/pcs-xpcs.h index f37e8acfd351..da3a6c30f6d2 100644 --- a/include/linux/pcs/pcs-xpcs.h +++ b/include/linux/pcs/pcs-xpcs.h @@ -32,6 +32,7 @@ struct dw_xpcs { struct mdio_device *mdiodev; const struct xpcs_id *id; struct phylink_pcs pcs; + phy_interface_t interface; int dev_flag; }; -- cgit v1.2.3